home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-03-01 | 224.5 KB | 3,715 lines |
-
-
-
- WHAT YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS
- ***************************************************************
-
-
-
- This document has been created to provide the net.community
- with some detailed information about mathematical coprocessors
- for the Intel 80x86 CPU family. It may also help to answer
- some of the FAQs (frequently asked questions) about this topic.
- The focus of this document is on 387 compatible chips, but
- there is also some information on the other chips in the 80x87
- family and the Weitek coprocessors. Care was taken to make the
- information included as accurate as possible. If you think you
- have discovered erroneous information in this text, or think
- that a certain detail needs to be clarified, or want to suggest
- additions to this text, feel free to contact me at:
-
- S_JUFFA@IRAVCL.IRA.UKA.DE
-
- or at my snail mail address:
-
- Norbert Juffa
- Wielandtstr. 14
- 7500 Karlsruhe 1
- Germany
-
-
- CONTENTS of this document
-
- 1) What are math coprocessors?
- 2) What applications benefit from using a math coprocessor
- 3) Installing a math coprocessor
- 4) Description of available math coprocessors, special features,
- available speeds, packaging, power consumption
- 5) Price information
- 6) How do math coprocessors work
- 7) Performance comparison of math coprocessors
- 8) Test for IEEE-754 conformance and accuracy of transcendental
- functions for different math coprocessors
- 9) References (literature)
- 10)Addresses of manufacturers of math coprocessors
- 11)Appendix A: Test programs for partial compatibility checks
- 12)Appendix B: Benchmark programs TRNSFORM and PEAKFLOP
-
- What are math coprocessors?
-
- A coprocessor in the traditional sense is a processor that extends
- the capabilities of a CPU in a transparent manner. This means that
- from the programmer's view the CPU and coprocessor together look
- like one machine. The 80x87 math coprocessors are typical examples
- of such coprocessors. The 80x86 CPUs (with the exception of the 80486,
- which has a built-in 'coprocessor') can only handle 8, 16, or 32 bit
- integers as their primary data types. However, many applications
- require the use of floating-point numbers. Simply put, use of floating
- point numbers enables one to express not only integers, but also
- fractional values over a wide range. The most common application
- of floating point numbers is in scientific applications, where very
- small (e.g. Planck's constant) and very large numbers (e.g. speed
- of light) have to be expressed. But floating-point numbers are also
- useful for business applications such as computing interest. Since
- the 80x86 CPUs do not support floating-point numbers or operations
- on them directly, they have to be programmed using the CPU's integer
- capabilities. This results in slow computations when floating-point
- numbers are used. This is where the 80x87 coprocessors come in.
- Adding a 80x87 to a 80x86 based system augments the CPU architecture
- with eight floating point registers, five additional data types and
- over 70 additional mnemonics. This greatly enhances the system's
- capability to do floating-point computations, as the coprocessor is
- specifically designed to handle floating-point numbers efficiently.
- Like most things in life, floating-point arithmetic has been
- standardized. The relevant standard, to which I will refer quite
- often in this document, is IEEE-754 Standard for Binary Floating-Point
- Arithmetic [10,11]. The standard specifies numeric formats, value
- sets and how the basic arithmetic (+,-,*,/,sqrt, remainder) has to
- work. All the coprocessors covered in this document claim full or
- at least partial compliance with this standard. When browsing the
- literature for information on math coprocessors, you will also
- encounter quite a few acronyms that refer to them: MCP (Math
- CoProcessor), NDP (Numerical Data Processor), NPX (Numerical
- Processor eXtension), FPU (Floating Point Unit). The latter usually
- refers to the 'built-in coprocessor' of the i486.
-
-
- The only data type the 80x87 coprocessors (and the 80486 floating
- point unit, or FPU) can hold in their registers is an 80-bit long
- floating-point number. This data type (called temporary real or
- double extended precision) can represent numbers which range in
- size between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to
- 1.19*10^4932 including denormal numbers) where the '^' denotes the
- power operator. For those familiar with floating point formats, this
- format has 64 mantissa bits, 15 exponent bits and 1 sign bit for
- the total of 80 bits. This format provides a precision of about
- 19 decimal places. The 80x87 can handle additional data types
- that are converted to/from the internal format upon being loaded/
- stored to/from the coprocessor. These include 16 bit, 32 bit, and
- 64 bit integers as well as a 18 digit BCD (binary coded decimal)
- occupying 10 bytes and two additional floating point types. The
- short real data type, also called single precision, has 32 bits
- that split into 23 mantissa bits, 8 exponent bit and a sign
- bit. This format provides a precision of about 6-7 decimal places
- and can represent numbers between 1.17*10^-38 and 3.40*10^38
- (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
- real, or double precision, data type has 64 bits, consisting of
- 52 mantissa bits, 11 exponent bits and the sign bit. It provides
- 15-16 decimal digits of precision and can handle numbers from
- 2.22*10^-308 to 1.79*10^308 (4.94*10^-324 to 1.79*10^308 including
- denormal numbers).
-
- In addition to load/store the above mentioned operand types, the
- 80x87 coprocessors can perform all the basic arithmetic operation
- on floating point numbers. Besides 'knowing' how to add, subtract,
- multiply and divide they can also compare floating-point numbers,
- change the sign, take the square root or absolute value, compute
- the remainder and compute some of the transcendental functions,
- like the logarithm. The eight registers in the 80x87 are organized
- in a stack-like manner which takes some time getting used to if
- one programs the coprocessor directly in assembler. However,
- nowadays the compilers or interpreters for most high level
- languages (HLL) can give the programmer access to the coprocessor's
- data types and use their instructions, so there is not much need
- to worry about the rather unusual architecture of the 80x87.
-
- Strictly speaking, the Weitek Abacus 3167 and 4167 are not
- coprocessors in that they do not transparently extended the
- CPU architecture. Rather they could be described as special
- memory mapped IO-devices. Since the term coprocessor has been
- traditionally used for these chips, they are also called by
- that term in this document. The architecture of the Weitek
- chips differs significantly from the 80x87. The Weitek's register
- file consists of 31 32-bit register, each one capable of holding
- an IEEE single precision number. Pairs of consecutive single
- precision registers can also be used as 64-bit IEEE double
- precision register. Thus there are 15 double precision registers.
- The Weitek register file has the standard organization known from
- other registers files like those in the 80386, not the special
- stack-like organization of the 80x87 coprocessors. The Weitek
- coprocessors have been tuned for maximum performance. Therefore,
- only a small instruction set has been implemented, but each
- instruction executes at a very high speed, usually only a few
- clock cycles each. Instructions available are load/store, add,
- subtract, subtract reverse, multiply, multiply and negate,
- multiply and accumulate, multiply and take absolute value,
- divide reverse, negate, absolute value, compare/test, convert
- fix/float, and square root. Note that the Weitek Abacus does not
- support a double extended format, has no built-in transcendental
- functions, and does not support denormals. The ressources required
- to implement such features have instead been devoted to implement
- the basic arithmetic operations as fast as possible. While the
- 80x87 coprocessors perform all internal calculations in double
- extended precision and therefore have about the same performance
- for single and double precision calculations, the Weitek features
- explicitly single and double precision operations. For applications
- that require only single precision the Weitek provides additional
- performance that way, as single precision operations are about
- twice as fast as their double precision counterparts. Since the
- Weitek Abacus has more registers than the 80x87 coprocessors,
- values can be kept in registers more often and have to be loaded
- from memory less frequently. This also leads to a performance gain.
- To the CPU, the Weitek Abacus looks like a 64 kB block of memory
- starting at physical address 0C0000000h. Every address in this
- range corresponds to a coprocessor instruction. Accessing a
- specified memory location within this block with the 386/486's
- MOV instruction causes the corresponding instruction to be executed.
- The instructions have been assigned to memory locations in such a
- way that loads to consecutive coprocessor registers can make use
- of the 386/486 MOVS string instruction. The memory mapped interface
- of the Weitek coprocessors is much faster than the IO-oriented
- protocol that is used to couple the CPU to the 80287 and 80387
- coprocessors. The Weitek's starting address of 0C0000000h is only
- a physical address. The Weitek's memory block can be assigned to
- any logical address using the MMU (memory managment unit) in the
- 386/486's protected and virtual modes. This also means that the
- Weitek Abacus 3167 and 4167 can *not* be used in the real mode
- of those processors, since the physical start address of the
- Weitek coprocessors is not within the 1 MByte address range and
- the MMU is inoperable in real mode. However, DOS programs can
- make use of the Weitek Abacus by using a DOS extender or a
- memory manager like EMM386 that run in protected/virtual mode
- themself and can therefore map the Weitek's memory block to
- any desired location in the 1 MByte address range. Typically
- the FS segment register is then set up to point to the Weitek's
- memory block. The Weitek Abacus 3167 and 4167 are also supported
- by the UNIX operating system [33].
-
-
- What applications will profit by using a math coprocessor?
-
- According to the Intel 387DX User's Guide, there are more
- than 2100 commercial programs that can make use of a 387
- compatible coprocessor. Every program that uses floating
- point arithmetic somewhere and supports a 80x87 coprocessor
- can gain speed by installing a coprocessor. However, the
- speedup will vary from program to program and even within
- the same program depending on how computation intensive the
- program or operation within the program is. Typical applications
- that benefit from the use of a 80x87 coprocessor are:
- - Business graphics programs, such as Arts&Letters, Freedom
- of Press, and Freelance
- - Spreadsheet programs like Lotus 1-2-3, Excel, Quattro, and
- Wingz
- - CAD programs such as AutoCAD, VersaCAD, and GenericCAD
- - Database programs such as dBase IV, FoxBase, and Paradox
- - Math and Science programs such as Mathematica, TKSolver,
- SPSS/PC, and Statgraphics
- Note that for spreadsheets and databases, a coprocessor
- only helps if some kind of floating point computations
- is performed. This is true more often for spreadsheets
- than for data bases. Also note that the speed of many
- programs depends quite heavily on the speed of the graphics
- adaptor (CAD) or the disk performance (databases), so the
- computational performance is only a (small) part of the
- total performance of the application. There are some programs
- that won't run without a coprocessor, among them AutoCAD R10
- and later and Mathematica. GUIs (graphical user interfaces)
- such as Windows do *not* gain additional speed from using a
- *mathematical* coprocessor, since their graphics operations
- only use integer arithmetic. They benefit from a graphics
- board with a graphical 'coprocessor' though that speed up
- certain common operations such as BitBlt or line drawing.
- However, applications running under Windows may take advantage
- of a math coprocessor, e.g. Excel.
-
- While support for 80x87 coprocessors is very common in application
- programs, the Weitek Abacus coprocessors do not enjoy such wide
- spread support. Due to their high price, only a few high-end PCs
- have been equipped with Weitek coprocessors. Therefore most of
- the programs that support these coprocessors are also high-end
- products like AutoCAD and Versacad-386.
-
-
-
- Installing a math coprocessor
-
- Usually, installing a coprocessor doesn't pose much of a problem,
- as every coprocessor comes with installation instructions and a
- diagnostic disk that lets you check for correct operation once
- the coprocessor has been installed. In addition, the user manuals
- of most computers have a section on coprocessor installation.
-
- 1) Make sure to get the right coprocessor for your system. An
- 8087 works together with 8086, 8088, V20, and V30 CPUs. An
- 80287, 287XL or compatible works together with a 80286 CPU.
- There are also some old 386 motherboards that accept a 80287
- coprocessor, but they usually also provide a socket for the
- 387 and I recommend to get a 387 then for use with these
- systems. A 80387, 387DX or compatible coprocessor is for 386
- based systems, as is the Intel RapidCAD. 387 coprocessors
- also work together with Cyrix' 486DLC CPU which despite its
- name does not include an FPU. Similarly, the 387SX or compatible
- coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.
- The Weitek Abacus 3167 works with a 386 CPU but requires a
- 121-pin EMC socket in your system. Some computers, such as
- IBM's PS/2s don't have this socket. The Weitek Abacus 4167
- works together with the 486 and requires the appropriate
- 142-pin socket to be present.
- Always install a coprocessor that is rated at the same speed
- as the CPU. That is, for a 40 MHz 386 system using AMD Am386-40,
- install a coprocessor rated for 40 MHz such as a Cyrix 83D87-40,
- IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its
- specified frequency rating may cause it to produce false results,
- which you might fail to recognize as such. I have personally
- experienced this problem with a Cyrix 83D87-33 that I tried
- to push to 40 MHz. It passed all the diagnostic benchmarks
- on the Cyrix diagnostic disk and the tests of some commercial
- system test programs. However, I found it to fail the
- Whetstone and Linpack benchmarks, which include accuracy
- checks. So although there is usually no problem with overheating
- when pushing a coprocessor over the specified maximum frequency
- rating, be warned that operation of a coprocessor above the
- maximum ratings stated by the manufacturer makes operation
- unreliable. Some 386 boards allow the coprocessor to be clocked
- differently than the CPU. This is called asynchronous operation
- and allows you to run the coprocessor at 33 MHz while the CPU
- runs at 40 MHz, for example. Please note that only the Intel
- 80387 and 387DX support asynchronous operation. The 387 'clones'
- from Cyrix, IIT and ULSI always run at the full speed of the
- CPU, even if you have set up your motherboard for asynchronous
- operation.
- 2) Once you've got the correct coprocessor for your system you
- can start the actual installation process:
- - turn off the computer's power switch and unplug the power
- cord from the wall outlet
- - remove the cover of your computer
- - locate the math coprocessor socket. This socket is located
- right next to the CPU, which can be identified by the
- printing on top of the chip. The CPU usually is one of the
- biggest chips on the board. The 8078 and 80287 DIL sockets
- are rectangular sockets with 20 pin holes on each of the
- longer sides. The 387SX PLCC socket is a square socket that
- has 17 vertical connector strips on the 'wall' of each side.
- The 387 PGA socket is square and has two rows of pin holes
- on each side. The EMC socket is similar but has three rows
- of holes on each side. The PGA socket for the Weitek 4167 is
- also square with three rows of holes on each side. If the CPU
- and coprocessor socket is on a separate card rather than on
- the motherboard (typical for modular systems), you have to
- remove the card and place it on a flat and hard surface free
- of static electricity. If you can't find the math coprocessor
- socket, consult your owner's manual or your computer dealer.
- If you want to install the Intel RapidCAD in a 386 system,
- you will have to remove the 386 CPU before starting to
- install the two RapiCAD chips. Intel provides an easy to
- use chip extractor and a storage box for the 386 chip for
- this purpose. Just follow the instructions in the RapidCad's
- installation manual.
- - Be sure you are properly grounded before you remove the
- coprocessor from its antistatic box. Static electricity
- can damage the coprocessor. Make sure you do not touch
- the pins.
- - Check if all pins are straight and not bend. If you find
- bent pins, carefully straigthen them with needle-nose pliers
- or tweezers.
- - Match the coprocessors orientation with the orientation
- of the socket. 8087 and 287 coprocessors have anotch on one
- the shorter sides of their rectangular DIL package that should
- be matched with the notch of the coprocessor socket. Usually
- the 286 CPU and the 287 coprocessor are placed alongside each
- other and both have the same orientation, that is their
- respective notches point in the same direction. 387SX
- coprocessors feature a white dot or similar mark that matches
- with some sort of marking on the socket. 387 coprocessors
- have a beveled corner that is also marked with a white dot
- or similar marking. This should be matched with the beveled
- or otherwise marked corner of the socket. If you install
- a 387 coprocessor in an EMC socket, leave one row of holes
- free on each side. Correct orientation of the coprocessor
- is absolutely essential, because if you insert it the
- wrong way it may be damaged. If you have found the correct
- orientation, make sure all pins are correctly aligned with
- their respective holes. Press firmly and evenly on the chip.
- You may have to press hard to seat the coprocessor all the
- way. Make sure your motherboard does not bend more than
- slighty under the insertion pressure. Otherwise it may
- develop cracks that could damage the signal lines on the
- board. For 8087, 287, and 387 coprocessors it is normal that
- the coprocessor does not go all the way in but about one
- millimeter (1/25 inch) of space is left between the socket
- and the bottom of the coprocessor chip. This enables the
- insertion of a extraction device should it become necessary
- to remove the coprocessor. Note that the construction of the
- 387SX's PLCC socket makes it next to impossible to remove
- the coprocessor once fully inserted, as the top of the chip
- is level with the socket's 'walls' then.
- 3) Check your computer's manual for the jumpers and/or switches
- you may have to set for coprocessor operation.
- Put the cover back on the system unit and reconnect the power.
- Turn on your computer. Depending on your BIOS, you may have
- to run the setup or configuration program to register the
- coprocessor.
- Use the diagnostic disk included with your coprocessor to
- check for correct operation of your coprocessor.
-
-
-
- Coprocessor emulations
-
- In the absence of a coprocessor, floating-point calculations
- are most often performed by a software package that simulates
- the operations of the coprocessor. Such a program is called
- a coprocessor emulator. Simulating the coprocessor has the
- advantage that identical code can be generated for the
- coprocessor and the emulator so that it is possible to write
- programs that run on both, systems with and systems without a
- coprocessor. Wether the program is to use the coprocessor or the
- emulator can then be decided at run-time by checking if a
- math coprocessor is present in the system.
-
- Two approaches to interface an 80x87 emulator to programs are
- common. While the first method works with all 80x86 processors,
- the second only works from the 80286 on. The first method makes
- use of the fact that all coprocessor instruction start with the
- same five bit pattern 11011. Thus the first byte of a coprocessor
- instruction will be in the range D8-DF hexadecimal. In addition,
- coprocessor instructions usually are preceeded by a WAIT instruction
- (opcode 9Bh) which is one byte long (the reason for doing this
- is described in a later chapter on the operation of the 80x87).
- One common approach is to replace the WAIT instruction and the
- first byte of the coprocessor instruction with one of eight
- interrupts; the remaining bytes of the coprocessor instruction
- are left unchanged. Interrupts 34 to 3B hexadecimal are used for
- this emulation technique. Note that the sequences 9B D8 .. 9B DF
- can be easily converted to the interrupt instructions CD 34 .. CD 3B
- by simple addition and subtraction of constants. The compiler or
- assembler produces code that contains the appropriate interrupt
- calls instead of the coprocessor instructions. If a coprocessor
- is detected at run-time, the emulator interrupts point to a short
- routine that converts the interrupts calls back to coprocessor
- instructions (self modifying code). If no coprocessor is found
- the interrupts point to an emulation package which examines the
- byte(s) following the interrupt intruction to determine what
- operation to perform. The method described is used by the compilers
- from Microsoft and Borland for example. It works with every
- 80x86 CPU from the 8086/8088 on.
- The second method to interface an emulator is only available on
- 286 and 386 machines. If the emulation bit in the machine status
- word of these processors is set, the processors will generate an
- interrupt 7 whenever a coprocessor instruction is encountered.
- The interrupt vector then points to an emulation package that
- decodes the instruction and performs the desired operation. This
- approach has the advantage that the emulator doesn't have to be
- included in the program code, but can be loaded as a TSR or
- device driver once and then used by every program that requires
- a coprocessor. Emulation via interrupt 7 is transparent, which
- means that programs containing coprocessor instructions execute
- just like a coprocessor was present, only slower. This approach
- is taken by the public domain EM87 emulator and the commercial
- Franke387 emulator, for example. Even programs that require a
- coprocessor to run like AutoCAD are 'fooled' to believe that
- a coprocessor is present with emulators using INT 7.
-
- The size of the emulator used by TP 6.0 is about 9.5 kB, EM87
- occupies about 15.8 kB as a TSR, and Franke387 uses about 13.4 kByte
- as a device driver. Note that Franke387 and especially EM87 model
- a real coprocessor much more closely than Turbo Pascal's emulator
- does. In particular, EM87 supports denormal numbers, precision
- control, and rounding control. The emulator in TP 6.0 does not
- implement these features. The version of Franke387 tested (V2.4)
- supports denormals in single and double precision, but not
- double extended precision. It supports precision control, but
- not rounding control. Intel's E80287 is supposed to be an 100%
- exact emulation of the 80287 coprocessor [44]. Generally, the
- more closely a real coprocessor is modelled by the emulator,
- the slower does the emulator run and the larger the code for the
- emulator is.
-
-
- Relative execution times of coprocessor vs. software emulators
- for particular coprocessor instructions
-
- Intel 387DX TP 6.0 Emulator EM87 Emulator
-
- FADD ST, ST(0) 1 26 104
- FDIV [DWord] 1 22 136
- FXAM 1 10 73
- FYL2X 1 33 102
- FPATAN 1 36 110
- F2XM1 1 38 110
-
-
- The following table is an excerpt from [44]:
-
- Intel 80287 Intel E80287 Emulator
-
- FADD ST, ST(0) 1 42
- FDIV [DWord] 1 266
- FXAM 1 139
- FYL2X 1 99
- FPATAN 1 153
- F2XM1 1 41
-
-
-
- The following has been adapted from [43] and merged with my own
- data:
-
- Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086)
-
- FADD ST, ST(0) 1 20 94
- FDIV [DWord] 1 22 82
- FPTAN 1 18 144
- F2XM1 1 6 171
- FSQRT 1 44 544
-
-
-
- One of the reasons emulators are so slow is that they are
- often designed to run with every CPU from the 8086/8088 on.
- This is the case with the emulators built into the compiler
- libraries of the Turbo Pascal 6.0 (also used by Turbo C/C++)
- and Microsoft C 6.0 compiler (probably also used in other
- Microsoft products) and is also true for the EM87 emulator
- in the public domain. By using code that can run on a 8086/8088,
- these emulators forego the speed advantage offered by the
- additional instructions and architectureal enhancements (such
- as 32-bit registers) of the more advanced Intel 80x86 processors.
- A notable exception is the Franke387 emulator, a commercial
- emulator that is also sold as shareware. It uses 386 specific
- 32-bit code and only runs on 386/386SX computers.
-
- Besides being slow, coprocessor emulators have other drawbacks
- compared with real coprocessors. Most of the emulators do not
- support the additional instructions that the 387 compatible
- coprocessors offer over the 80287. Often, some of the low-level
- stack-manipulating instructions like FDECSTP are not emulated.
- The coprocessor status register is not or only partially emulated.
- Some emulators do not conform to the IEEE-754 standard in their
- implementation of the basic arithmetic functions, while the
- coprocessors do. Also, they sometimes lack the support for
- denormals (a special class of floating point numbers) although
- it is required by the standard. Not all the 80x87 emulators
- support rounding control (a feature required by IEEE-754) and
- precision control (a feature of the 80x87 coprocessor). Most of
- the ommisions are aimed at making the emulator faster and smaller.
- Because of the shortcomings of coprocessor emulators, a real
- coprocessor is a must for anybody planning to do some serious
- computations. At todays prices, this shouldn't pose much of a
- problem to anybody.
-
-
- Available coprocessors, CPU+FPU as of 08-10-92:
-
-
- Intel 8087 [43] was the first coprocessor that Intel brought
- out for the 80x86 family. It was introduced in 1980
- and therefore does not have full compatibility with
- the IEEE-754 standard for floating point arithmetic,
- which was finally released in 1985. It complements
- the 8088 and 8086 CPUs and can also be interfaced
- to the 80188 and 80186 processors. It comes in a
- 40 pin CERDIP (ceramic dual inline package). It
- is available in 5 MHz, 8 Mhz (8087-2), and 10 MHz
- (8087-1) versions. The 8087 is implemented using
- NMOS. Power consumption is rated at max. 2400 mW [42].
- A neat trick to enhance the processing power of the
- 8087 for computations that use only the basic
- arithmetic operations (+,-,*,/) and do not require
- high precision is to set the precision control to
- single precision. This gives one a performance
- increase of up to 20%. For details about programming
- the precision control, see program PCtrl in appendix A.
- Intel 80187 is a rather new coprocessor designed to support the
- 80C186 embedded controller. It was introduced in 1989
- and implements the complete 80387 instruction set.
- It is available in a 40 pin CERDIP (ceramic dual
- inline package) and a 44 pin PLCC (plastic leaded
- chip carrier) for 12.5 and 16 MHz operation. Power
- consumption is rated at max. 675 mW for the
- 12.5 MHz version and max. 780 mW for the 16 MHz
- version [37].
- Intel 80287 [44] is the original Intel coprocessor for the 80286
- and was introduced in 1983. It uses the same execution
- unit as the 8087 and therefore has the same speed
- (sometimes slower due to additional overhead in CPU
- coprocessor communication). As the 8087, it does not
- provide full compatibility with the IEEE-754 floating
- point standard released in 1985. It was manufactured
- in NMOS technology. There are 6 MHz, 8 MHz, and 10
- MHz versions. The chip comes in a 40 pin CERDIP
- (ceramic dual inline package). Power consumption can
- be estimated to be the same as that for the 8087,
- which is max. 2400 mW. The 80287 has been replaced
- in the Intel 80x87 family with its successor, the
- Intel 287XL, which has been introduced in 1990. The
- 287XL is done in CMOS. It is based on the 387 core
- and therefore much faster than the 80287. There may
- still be a few of the old 80287 chips on the market
- though.
- Intel 80287XL is the second generation 287 introduced by Intel
- in 1990. Since it is based on the 387 core, it
- features full IEEE 754 compatibility and faster
- execution of coprocessor instructions. Intel claims
- about 50% faster operation than the 80287 for typical
- benchmark test such as Whetstone [45]. Comparison
- with benchmark results for the AMD 80C287, which is
- identical to the Intel 80287, support this claim [1].
- The Intel 287XL performed 66% faster than the AMD
- 80C287 on the fractal benchmark and 66% faster on
- the Whetstone benchmark in these tests. Whetstone
- results from [46] show the Intel 287XL at 12.5 MHz
- to perform 552 kWhets/sec as opposed to the AMD's
- 80C287 289 kWhets/sec, a 91% performance increase.
- A benchmark using the MathPak program showed the
- Intel 287XL to be 59% faster than the Intel 80287
- (6.9 sec. vs. 11.0 sec.) [26]. Since the 287XL
- has all the additional instructions and enhancements
- of a 387, most software automatically identifies
- it as an 80387 compatible coprocessors and makes
- use of the extra features available like the FSIN
- and FCOS instructions. The 287XL is done in CMOS
- and therefore uses less power than the older 80287,
- which was done in NMOS. The 287XL is rated for
- speeds of up to 12.5 MHz. At 12.5 MHz, the power
- consumption is rated at max. 675 mW, about 1/4 of
- the 80287 power consumption. The 287XL comes in
- either a 40 pin CERDIP (ceramic dual inline package)
- or a 44 pin PLCC (plastic leaded chip carrier). The
- latter version is called the 287XLT and intended
- mainly for laptop use.
- AMD 80C287 is an exact clone of the old Intel 80287 that was
- brought to market by AMD in 1989. It contains the
- original microcode of the 80287 and is therefore
- 100% compatible with this chip. However, as the name
- indicates, the 80C287 is manufactured in CMOS and
- therefore uses less power than an equivalent Intel
- 80287. At 12.5 Mhz, its power consumption is rated
- at max. 625 mW or slightly less than that of the
- Intel 80287XL [27]. There is also another version
- called AMD 80EC287 that uses an 'intelligent' power
- save feature to reduce the power consumption below
- 80C287 levels. Tests at 10.7 MHz show typical power
- consumption for the 80EC287 to be at 30mW, compared
- to 150 mW for the AMD 80C287, 300 mW for the Intel
- 287XL and 1500 mW for the Intel 80287 [57]. The
- 80EC287 is therefore ideally suited for low power
- laptop systems. The AMD 80C287 is available in speeds
- of 10, 12, and 16 MHz. I have only seen it being
- offered in 10 MHz and 12 MHz versions though. At
- about US$ 50, it is the cheapest coprocessor available.
- Note that it provides less performance than the
- newer Intel 287XL (see above for details). The AMD
- 80C287 is available in 40 pin ceramic and plastic
- DIPs (dual inline package) and as 44 pin PLCC
- (plastic leaded chip carrier). Due to recent legal
- battles with Intel over the right to use the 287
- microcode, which AMD lost, AMD may have to discontinue
- this product (disclaimer: I am not a legal expert).
- Cyrix 82S87 was developed from the Cyrix 83D87, Cyrix' 387 'clone'
- and has been available since 1991. It implements the
- full 387 instruction set. It totally complies with
- the IEEE-754 standard for floating point arithmetic
- and features nearly total compatibility with Intel's
- coprocessors. It implements the transcendental
- functions with the same degree of accuracy and the
- superior speed of the Cyrix 83D87. This makes the
- Cyrix 82S87 the fastest [1] and most accurate 287
- compatible coprocessor available. Documentation by
- Cyrix [46] rates the 82S87 at 730 kWhets/sec for a
- 12.5 MHz system, while the Intel 287XL performs only
- 552 kWhets/sec. The 82S87 is a fully static CMOS
- design with very low power requirements that can
- run at speeds of 6 to 20 MHz. Cyrix documentation
- shows the 82S87 to consume about the same amount of
- power as the AMD 80C287 (see above). The 82S87 comes
- in a 40 pin DIP or a 44 pin PLCC (plastic leaded
- chip carrier) compatible with the pinout of the
- Intel 287XLT and ideally suited for laptop use.
- IIT 2C87 was the first 287 clone available. It was introduced
- to the market in 1989. It has about the same speed
- as the Intel 287XL [1]. The 2C87 implements the
- full 387 instruction set [38]. Tests I ran on the
- 3C87 seem to indicate that it is not fully compatible
- with the IEEE-754 standard for floating-point
- arithmetic (see below for details), so it can be
- assumed that the 2C87 also fails these test as it
- presumably uses the same core as the 3C87. The IIT
- 2C87 provides extra functions not available on any
- other 287 chip [38]. It has 24 user accessible
- floating-point registers organized into three register
- banks. Additional instructions (FSBP0, FSBP1, FSBP2)
- allow switching from one bank to another. Transfers
- between registers in different banks are not
- supported however, so this feature by itself
- is of limited usefulness. Also there seems to
- be only one status register (containing the
- stack top pointer), so it has to be manually
- loaded and stored when switching between banks
- with a different number of registers in use [40].
- The register bank's main purpose is to aid the
- fourth additional instruction the 2C87 has
- (F4X4), which does a full multiply of a 4x4 matrix
- by a 4x1 vector, an operation common in 3D graphics
- applications [39]. The built-in matrix multiply
- speeds this operations up by a factor of 6 to 8
- compared with a programmed solution according to
- the manufacturer [38]. Tests show the speed-up
- to be indeed in this range [40]. For the 3C87, I
- measured the execution time of F4X4 to be about
- 280 clock cycles, the execution time on the 2C87
- should be somewhat bigger. I estimate it to be
- around 310 clock cycles due to the higher CPU-NDP
- communication overhead in instruction execution in
- 286/287 systems (~45-50 clock cycles) compared with
- 386/387 systems (~16-20 clock cycles). As useful as
- the F4X4 instruction may seem, there are only very
- few applications that make use of this feature if
- a IIT coprocessor is detected at run time, among
- them Schroff Development's Silver Screen and
- Evolution Computing's Fast-CAD 3-D [25]. The 2C87
- is available for speeds of up to 20 MHz. It is
- implemented in an advanced CMOS process and has
- therefore a low power consumption of typically
- about 500 mW [38].
- Intel 387 was the first generation of coprocessors for the
- Intel 386. It was introduced in 1986, about one
- year after introduction of the 80386. Early 386
- system were therefore equipped with a 80287 and a
- 80387 socket. The 80386 works together with the
- 80287 but the numerical performance is hardly
- adequate for such a system. The 80387 has since
- been superseeded by the Intel 387DX introduced
- by a quiet change in 1990. You might find it
- when aquiring an old 386 machine, though. The
- 80387 is about 20% slower than the newer 387DX
- (see the paragraph below for detailed information).
- Like the other 387 coprocessors, the 80387 is packaged
- in a 68-pin ceramic PGA. The Intel 80387 is
- manufactured using Intel's older 1.5 micron CHMOS
- III technology that has moderate power requirements.
- Power consumption at 16 MHz is max. 1250 mW (750 mW
- typical), at 20 MHz it is max. 1550 mW (950 mW
- typical), and at 25 MHz it is max. 1950 mW (1250 mW
- typical) [60].
- Intel 387DX is the second generation Intel 387 that was quietly
- introduced in 1989. This version is done in a more
- advanced CMOS process than the 80387 that enables
- the coprocessor to run at a maximum frequency of 33
- MHz, while the 80387 had a maximum frequency of 25 MHz.
- The 387DX is about 20% faster than the 80387 on the
- average for the same clock frequency. For a 386/387
- system operating at 29 MHz the Whetstone benchmark
- compiled with the highly optimizing Metaware High-C
- V1.6 runs at 2377 kWhetstones/sec for the 80387 and
- at 2693 kWhetstones/sec for the 387DX, a 13% increase.
- In a fractal calculation programmed in assembly
- language, the 387DX performance was 28% higher than
- the performance of the 80387. The transcendental
- functions have also sped up from the 80387 to the
- 387DX. In the Savage benchmark compiled with the
- Metaware High-C V1.6 optimizing compiler and running
- on a 29 MHz system, the 80387 evaluated 77600 function
- calls/second, while the 387DX evaluated 97800 function
- calls/second, a 26% increase [7]. Some instructions
- have been sped up a lot more more than the average
- 20%. For example the FBSTP instruction has been sped
- up by a factor of 3.64. The Intel 387DX (and its
- predecessor 80387) are the only 387 coprocessors
- that support asynchronous operation of CPU and NDP.
- The 387 consists of a bus interface unit and a
- numerical execution unit. The bus interface unit
- always runs at the speed of the CPU clock (CPUCLK2).
- If the CKM (ClocK Mode) pin of the 387 is strapped
- to Vcc, the numerical execution unit runs at the
- same speed as the bus interface unit. If CKM is tied
- to ground, the numerical execution unit runs at the
- speed provided by the NUMCLK2 input. The ratio of
- NUMCLK2 (coprocessor clock) to CPUCLK2 (CPU clock)
- must lie within the range 10:16 to 14:10. For example,
- for a 20 MHz 386, the Intel 387DX could be clocked
- from 12.5 MHz to 28 MHz via the NUMCLK2 input. On
- the Cyrix 83D87, Cyrix 387+, ULSI 83C87, and the IIT
- 387, the CKM pin is not connected. These coprocessors
- always run at the speed of the CPU. The Intel 387DX
- is manufactured using Intel's advanced low power
- CHMOS IV technology. Power consumption at 20 MHz is
- max. 900 mW (525 mW typical), at 25 MHz it is max.
- 1050 mW (625 mW typical), and at 33 MHz it is 1250
- mW (750mW typical) [59].
- Intel 387SX is the coprocessor for the Intel 386SX. The 386SX is
- an Intel 386 with a 16-bit data path. This reduces
- somewhat the costs to build a complete system as
- compared to a full 32-bit design required by the
- 80386DX. The 386SX main purpose was to replace the
- 80286 CPU, which Intel subsequently stopped producing.
- Due to the 16-bit data path, the 386SX is slower than
- the 386DX and offers about the same speed as a 80286
- at the same clock frequency for 16-bit applications.
- As the 386SX is a complete 80386, it offers also the
- possibility to run 32-bit applications and supports
- the virtual 8086 mode used for example by Windows'
- enhanced mode. The 387SX has all the features the
- Intel 387DX offers, including the ability for
- asynchronous operation of CPU and coprocessor
- (see the above paragraph on the Intel 387DX for
- details). Due to the 16 bit data path between the
- CPU and the coprocessor, the 387SX is a bit slower
- than a 387DX operating at the same frequency. The
- 387SX comes in a 68-pin PLCC (pastic leaded chip
- carrier) package and is available in 16 Mhz and 20
- MHz versions. Coprocessors for faster 386SX systems
- based on the Am386SX CPU are available from IIT,
- Cyrix, and ULSI. Power consumption for the 387SX
- at 16 MHz is max. 1250 mW (740 mW typical), for
- the 20 MHz version it is max. 1500 mW (1000 mW
- typical) [62].
- IIT 3C87 came out in 1989 at about the same time as the
- Cyrix 83D87. Both coprocessors are faster than
- Intel's 387DX coprocessor. Tests I ran with the
- IEEETEST program show that the 3C87 is not fully
- compatible with the IEEE-754 standard for
- floating-point arithmetic although the manufacturer
- claims differently. It is well possible that the
- reported errors are due to personal interpretations
- of the standard by the program's author that have
- been incorporated into IEEETEST and that the
- standard also supports the different interpretation
- chosen by IIT. On the other hand, the IEEE test
- vectors incorporated into IEEETEST have become
- somewhat of an industry standard [66] and Intel's
- 387, 486, and RapidCAD chips pass the test without
- a single failure, so the fact that the IIT 3C87
- fails some of the tests indicates that it is not
- fully compatible with the Intel 387 coprocessor.
- My tests also show that the IIT 3C87 does not
- support denormals for the double extended format.
- It is not entirely clear wether the IEEE standard
- mandates support for extended precision denormals,
- as the IEEE-754 document explicitly only mentions
- single and double precision denormals. Missing
- support for denormals is not a critical issue with
- most applications but there are some programs for
- which support of denormals is quite helpful, if not
- important [41]. Anyhow, failure of the 3C87 to
- support extended precision denormal numbers is an
- incompatibility with the Intel 387 and 486. The 3C87
- provides extra functions not available on any other
- 387 chip [38]. It has 24 user accessible floating-point
- registers organized into three register banks.
- Additional instructions (FSBP0, FSBP1, FSBP2)
- allow switching from one bank to another. Transfers
- between registers in different banks are not
- supported however, so this feature by itself
- is of limited usefulness. Also there seems to
- be only one status register (containing the
- stack top pointer), so it has to be manually
- loaded and stored when switching between banks
- with a different number of registers in use [40].
- The register banks main purpose is to aid the
- fourth additional instruction the 3C87 has
- (F4X4), which does a full multiply of a 4x4 matrix
- by a 4x1 vector, an operation common in 3D graphics
- applications [39]. I measured this instruction to
- execute in about 280 clock cycles, during which
- time it executes 16 multiplications and 12 additions.
- The built-in matrix multiply speeds the matrix by
- vector multiply up by a factor of 3 compared
- with a programmed solution according to IIT [39].
- The results for my own TRNSFORM benchmark support
- this claim (see results below), showing a performance
- increase by a factor of about 2.5. This makes
- matrix multiplies on the IIT 3C87 nearly as fast as
- on an Intel 486 at the same clock frequency. However,
- there are only very few applications that make use
- of this feature if a IIT 3C87 is detected at run time,
- among them Schroff Development's Silver Screen and
- Evolution Computing's Fast-CAD 3-D [25]. Like the
- 387 'clones' from Cyrix and ULSI, the 3C87 does not
- support asynchronous operation of the CPU and the
- coprocessor. The 3C87 always runs at the full speed
- of the CPU. The 3C87 is implemented in an advanced
- CMOS process and has low power requirements of
- typically about 600 mW. It is available in 16, 20,
- 25, 33, and 40 MHz versions.
- IIT 3C87SX is the version of the IIT 3C87 that is intended for
- use with Intel's 386SX or AMD's Am386SX CPU. It is
- functionally equivalent to the IIT3C87. Due to the
- 16-bit data path between the CPU and the coprocessor
- in a 386SX based system, coprocessor instructions
- will execute somewhat slower than on the 3C87. The
- IIT 3C87SX is the only 387SX coprocessor that is
- offered at speeds of 16, 20, 25, and 33 MHz right
- now. I have read that Cyrix has also annouced a
- 83S87-33, but haven't seen it being offered yet.
- The 3C87SX is packaged in a 68-pin PLCC.
- Cyrix 83D87 was introduced in 1989, only shortly after the
- coprocessors from IIT. It has been the fastest
- 387 compatible coprocessor in several benchmark
- comparisons [1,7,68,69]. It also came out as the
- fastest coprocessor in my own tests (see benchmark
- results below). Although the Cyrix 83D87 provides
- up to 50% more performance than the Intel 387DX
- in benchmarks comparisons, the speed advantage
- over other 387 compatible coprocessors in real
- applications is usually much smaller. For example,
- in a test using the program 3D-Studio, the Cyrix
- 83D87 was 6% faster than the Intel 387DX [1].
- Besides being the fastest 387 coprocessor, the
- 83D87 also offers the most accurate transcendental
- functions results of all coprocessors tested (see
- test results below). Unlike the Intel coprocessors,
- which use the CORDIC [18,19] algorithm to compute
- the transcendental functions, Cyrix uses rational
- approximations to the functions. In the past the
- CORDIC method has been popular since it requires
- only shifts and adds which makes it easy to implement.
- It is also reasonably fast. Recently, the cost for
- the implementation for fast floating-point multipliers
- has dropped significantly due to the availablity of
- VLSI, making the use of rational approximations
- superior to CORDIC for the generation of transcendental
- functions [61]. The Cyrix 83D87 uses a very fast
- array multiplier, making its transcendental functions
- faster than those of any other 387 compatible
- coprocessor. It also uses 75 bit for the mantissa
- for intermediate calculations (as opposed to 68 bits
- on other coprocessors), making its transcendental
- functions more accurate than those of any other
- coprocessor or FPU (see results below). The 83D87
- and its successor, the 387+ are the 387 'clones'
- with the highest degree of compatibility. There
- are only very few SW and HW incompatibilties with
- the Intel 387DX. These have been documented by
- Cyrix [12]. The software differences are caused
- by some bugs present in the 387DX that Cyrix fixed
- for the 83D87. Unlike the Intel 387DX, the 83D87
- (and all other 387 'clones' as well) does not support
- asynchronous operation of CPU and coprocessor. There
- have also been problems in the past with the CPU -
- coprocessor communication, causing the 83D87 to
- hang on some machines. The reason was that Cyrix
- shaved off a wait state in the communication protocol,
- which caused a communications breakdown between the
- CPU and the 83D87 for some systems running at 25 MHz
- or faster. One notable example of this behavior was
- the Intel 302 board. The problem is only rarely
- encountered with the current generation of 386
- motherboards. It is possible that the problem has
- been entirely eliminated in the 387+, the sucessor
- to the 83D87. To reduce power consumption the 83D87
- features advanced power saving features. Those
- portions of the coprocessor that are not needed
- are automatically shut down. If no coprocessor
- instructions are being executed, all parts except
- the bus interface unit are shut down [12]. Maximal
- power consumption of the Cyrix 83D87 at 33 MHz is
- 1900 mW, typical power consumption at this clock
- frequency is 500 mW [15].
-
- Cyrix EMC87 is basically a special version of the Cyrix 83D87.
- In addition to the normal 387 operating mode, in
- which coprocessor-CPU communication is handled thru
- reserved IO-ports, it also offers a memory-mapped
- mode of operation similar to the operation principle
- of the Weitek Abacus. Please note that the EMC87 is
- *not* compatible with Weitek's Abacus coprocessor.
- They both use the same interface technique (memory
- mapping) but while the EMC87 uses the standard 387
- instruction set, the Weitek coprocessors use a
- different instruction set of their own. Like the
- Weitek Abacus, the EMC87 occupies a 64 kByte memory
- block starting at physical address C0000000h. It can
- therefore only be accessed in the protected or virtual
- modes of the 386 CPU. DOS programs can access the
- EMC87 with the help of DOS-extenders or memory
- managers like EMM386 which run in protected/virtual
- mode themself. Since the EMC87 provides also the
- standard CPU interface via IO-ports, it can be used
- just like any other 387 compatible coprocessor and
- delivers the same performance as the Cyrix 83D87 in
- this mode. However, using the memory mapped mode of
- the EMC87 provides a significant speed advantage.
- The traditional 387 CPU- coprocessor interface via
- IO-ports has an overhead of about 16-20 clock cycles.
- Since the Cyrix 83D87 executes some operations like
- addition and multiplication in much less time, its
- performance is limited by the CPU-coprocessor
- interface. The memory-mapped mode has much less
- overhead and allows all coprocessor instructions to
- be executed at full speed and with no penalty. For
- this reason, Cyrix introduced the EMC87 in 1990.
- In a test, the EMC87 at 33 MHz ran the single
- precision Whetstone benchmark at 7608 kWhetstones/sec,
- while the Cyrix 83D87 at 33 MHz had a speed of
- only 5049 kWhetstones/sec, an increase of 50.6% [63].
- In another test, the EMC87 ran a fractal computation
- at two times the speed of the Cyrix 83D87 and 2.6
- times as fast as an Intel 387DX [64]. A third test
- found the EMC87's overall performance to be 20%
- higher than the performance of the Cyrix 83D87
- [65]. The Cyrix FasMath EMC87 has also been sold
- as Cyrix AutoMATH by Cyrix. The two chips are 100%
- identical. Unlike the Cyrix 83D87, which fits into
- the 68-pin 387 coprocessor socket, the EMC87 comes
- in a 121-pin PGA and requires the 121-pin EMC
- (Extended Math Coprocessor) socket. Note that not
- all boards have such a socket, a notable exception
- being IBM's PS/2s, for example. Originally, Cyrix
- claimed support for the fast memory mapped mode of
- the EMC87 from a lot of software vendors (including
- Borland and Microsoft). However, there are only
- very few applications that make use of it, among
- them Evolution Computing's FastCAD 3D, MicroWay
- Inc.'s NDP FORTRAN-386 compiler and Intusofts's
- Spice [63]. I haven't seen the EMC being offered
- for about nine month now. It may be that Cyrix
- has discontinued this product due to lack of
- sufficient software support. The EMC87 was available
- in 25 and 33 MHz versions at the end of 1991.
- Cyrix 387+ seems to be the successor to the Cyrix 83D87. On
- ordering a Cyrix coprocessor about a month ago,
- I was automatically supplied with a 387+. In my
- tests, I found the Cyrix 387+ to be about five
- to 10 percent *slower* than the Cyrix 83D87. However,
- some instructions like the square root (FSQRT) now
- ony run at half the speed at which they ran in the
- 83D87 (see performance results below). I also found
- the transcendental functions on the 387+ to be a bit
- more accurate than those implemented in the 83D87.
- Why Cyrix has brought out a new coprocessor slower
- than the 83D87 I don't know. I have written to Cyrix
- about this question but haven't received a reply yet.
- Maybe the new coprocessor solves the one small
- hardware compatibility problem the 83D87 had (see
- above paragraph on the 83D87). It could also be that
- Cyrix had to design around the three Intel patents
- Intel claims the 83D87 has violated. I have no idea
- wether the Cyrix 387+ is to replace the 83D87 or
- if both chips will coexist in the market. Like the
- 83D87, the 387+ is available for speeds of up to
- 40 MHz.
- Cyrix 83S87 is the SX version of the Cyrix 83D87. Just like the
- Cyrix 83D87 is the fastest 387 compatible coprocessor,
- the Cyrix 83S87 is the fastest of the 387SX compatible
- coprocessor [1]. Besides being the fastest 387SX
- 'clone', the Cyrix 83S87 also features the most
- accurate transcendental functions. The 83S87 is
- packaged in a 68-pin PLCC and is available in 16,
- 20 and 25 MHz versions. Due to the advanced power
- saving features of the Cyrix coprocessor, the typical
- power consumption of the 20 MHz version is about
- 350 mW [67].
- ULSI 83C87 is a 387 'clone' that came out in early 1991, well
- after the IIT 3C87 and Cyrix 83D87. Like all clones,
- it is somewhat faster than the Intel 387DX. Especially
- the basic arithmetic functions are fast, while the
- transcendental functions show only a slight speed
- improvement over the Intel 387DX (see benchmark
- results below). In my tests, the ULSI had the most
- inaccurate transcendental functions. However, the
- maximum relative error is still within the limits
- set by Intel, so this is probably not an important
- issue in all but very few applications. The ULSI
- shows some minor flaws in the tests for IEEE-754
- compatiblity, but this, too, is unimportant under
- typical operating conditions. ULSI claims that the
- program IEEETEST, which was used to test for IEEE
- compatibility, contains many personal interpretations
- of the IEEE standard by the program's author and
- states that there is no ANSI-certified IEEE-754
- complicency test. While this is most probably true,
- it is also a fact that the IEEE test vectors used in
- IEEETEST are sort of an industry standard and that
- Intel's 387, 486, and RapidCAD chips pass it
- without a single failure. Since the ULSI Math*Co
- 83C87 fails some of the tests, it is certainly less
- than 100% compatible with Intel's chips, although
- this will hardly make any difference in typical
- operating conditions. The ULSI 83C87 is also not fully
- compatible with the Intel 387DX in that is does
- not implement the precision control feature of
- Intel's coprocessor [58]. While all the internal
- operations of 80x87 coprocessors are usually done
- with the maximum precision available (double extended
- presision with 64 mantissa bits), the 80x87 also
- offer the possiblity to force lower precision to
- be used for the basic arithmetic functions add,
- subtract, multiply, divide, and square root. This
- feature was included for compatiblity with existing
- floating-point implementations at the time the 8087
- was devised. All coprocessors except the ones from
- ULSI support this feature. Since precision control
- is rarely used, this incompatibility with the Intel
- 387DX does not pose major problems. IEEE-754 mentions
- precision control, but requires it only for those
- systems that don't have the possibility to store
- single and double precision results. Therefore, the
- standard does not call for precision control in the
- 387 coprocessor, so the ULSI 83C87's failure to
- provide rounding control does not constitute a
- conflict with the IEEE-754 standard for floating
- point arithmetic. Like the other 387 'clones', the
- 83C87 does not support asynchronous operation of the
- CPU and the coprocessor. This means that the 83C87
- always runs at the full speed of the CPU. The ULSI
- 83C87 is available in 20, 25, 33, and 40 MHz versions.
- The ULSI is produced in high perfromance, low power
- CMOS. Power consumption at 20 MHz is max. 800 mW
- (400 mW typical), at 25 MHz it is max. 1000 mW
- (500 mW typical), at 33 MHz it is max. 1250 mW
- (625 mW), and at 40 MHz the ULSI Math*Co 83C87
- consumes max. 1500 mW (750 mW typical) [58]. The
- 83C87 is packaged in a 68-pin ceramic PGA. ULSI
- coprocessors come with a lifetime warranty. ULSI
- Systems, Inc. will replace the coprocessor up to
- three times free of charge should it ever fail.
- ULSI 83S87 is the SX version of the ULSI 83C87 for operation
- with an Intel 387SX or an AMD Am387SX. It is
- functionally equivalent to the 83C87. To aid low
- power laptop designs, the ULSI 83S87 features an
- advanced power saving design with a sleep mode and
- a standby mode with only minimal power requirements.
- Power consumption under normal operating conditions
- (dynamic mode) is max. 400 mW at 16 MHz (300 mW
- typical), max. 450 mW at 20 MHz (350 mW typical),
- and max. 500 mW at 25 MHz (400 mW typical) [58].
- The ULSI 83S87 is packaged in a 68-pin PLCC.
- Intel RapidCAD is not a coprocessor, strictly seen, although it
- is marketed as one. Rather, it is a CPU replacement.
- It is basically an Intel 486DX without the cache and
- with a 386 pinout. RapidCAD is delivered as a set of
- two chips. RapidCAD-1 goes into the 386 socket and
- contains the CPU and FPU, RapidCAD-2 goes into the
- coprocessor socket and contains a PAL that generates
- the Ferr signal that is normally generated by a
- coprocessor and used by the motherboard circuitry to
- provide 287 compatible coprocessor exception handling
- in 386/387 systems. The RapidCAD instruction set is
- compatible with the 386, so it doesn't know the 486
- specific instructions like BSWAP. Since the RapidCAD
- CPU core is very similar to 486 CPU core, most of the
- register to register instructions execute in the same
- number of clock cycles as on the 486. The use of the
- 386 bus interface causes instructions that access memory
- to execute at about the same speed as on the 386. The
- integer performance on the RapidCAD is definitely
- limited by the low memory bandwidth provided by the
- 386 bus interface (2 clock cylces per bus cycle)
- and the lack of an internal cache. CPU instructions
- often execute faster than they can be fetched from
- memory, even with a big and fast external cache.
- Therefore, the integer performance of the RapidCAD
- exceeds that of a 386 by at most 25%. This value
- was derived by running some programs that use
- mostly register-to-register operations and few
- memory accesses. This finding is supported by the
- SPEC ratings that Intel reports for the 386-33
- and the RapidCAD-33. While the 386-33 has a
- SPECint of 6.4, the RapidCAD has a SPECint of 7.3
- [28], a 14% increase. Note that these tests used
- the old (1989) SPEC benchmarks suite. While CPU
- instructions often execute in one clock cycle on
- the RapidCAD, FPU instructions always take more
- than seven clock cycles. They are therefore rarely
- slowed down by the low memory bandwidth provided
- by the 386 bus interface. My tests show a 70%-100%
- performance increase for floating-point intensive
- benchmarks (see below) over a 386 based system
- using the Intel 387DX math coprocessor. This is
- consistent with the SPECfp rating reported by Intel.
- The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
- the RapidCAD is rated at 6.1 SPECfp at the same
- frequency, a 85% increase. This means that a system
- that uses the RapidCAD is faster than any 386/387
- combination, regardless of the type of 387 used
- (Intel 387DX or faster clone). The diagnostic disk
- for the RapidCAD also gives some application
- performance data for the RapidCAD compared to the
- Intel 387DX:
-
- Application Time w/ 387DX Time w/ RapidCAD Speedup
-
- AUTOCAD 11 32 sec 52 sec 63%
- AutoShade/Renderman 108 sec 180 sec 67%
- Mathematica(Windows) 103 sec 139 sec 35%
- SPSS/PC+ 4.01 14 sec 17 sec 21%
-
- RapidCAD is available in 25 MHz and 33 MHz versions.
- It is distributed through other channels than the
- other Intel math coprocessors. Therefore, I have been
- unable to obtain a data sheet for it. The RapidCad-1
- chip gets quite hot when operating and it can be
- assumed that its power consumption is similar to
- the 486-33. Therefore, I recommend extra cooling
- for this chip (see the paragraph below on the 486 for
- details). The RapidCAD-1 is packaged in a 132-pin
- PGA, just like the 80386, and the RapidCAD-2 is
- packaged in a 68-pin PGA like a 80387 coprocessor.
- Intel 486DX is not a coprocessor. This chip, brought out in
- 1989 functionally combines the CPU (a heavily pipelined
- implementation of the 386 architecture) with an
- enhanced 387 (the floating-point unit, FPU) and
- 8 kB of unified code/data cache on one chip. Of
- course, this description is simplified, for a
- detailed hardware description, see [52]. The
- 486DX offers about two to three times the integer
- performance of a 386 at the same frequency.
- Floating point performance is about three to four
- times as high as on the Intel 387DX at the same
- clock rate [29]. Since the FPU is on the same
- chip as the CPU, the considerable communication
- overhead between CPU and coprocessor in a 386/387
- system is omitted, letting FPU instructions run
- at the full speed permitted by the implementation.
- The FPU also takes advantage of the on-chip cache
- and the highly pipelined execution unit. Besides
- the higher speed, the 486 FPU features more accurate
- transcendental functions than the Intel 387DX
- coprocessor according to tests run by me (see below).
- To achieve better interrupt latency, FPU instructions
- with a long execution time have been made abortable
- in the case an interrupt occurs during their
- execution. The concurrent execution of CPU and
- coprocessor instructions typical for 80x86/80x87
- systems is still in existence on the 486, but
- some FPU instructions like FSIN have nearly no
- concurrency with CPU instructions, indicating
- that they make heavy use of both, CPU and FPU
- resources [53, 1]. The 486DX comes in a 168 pin
- ceramic PGA (pin grid array). It is available in
- 25 MHz and 33 Mhz versions. Since the end of 1991,
- there is also a 50 MHz version available done in
- a CHMOS V process (the 25 MHz and 33 MHz are
- produced using the CHMOS IV process). Maximum
- power consumption is 3500 mW for the 25 MHz 486
- (2600 mW typical), 4500 mW for the 33 MHz version
- (3500 mW typical), and 5000 mW (4000 mW typical)
- for the 50 MHz chip. Due to the considerable amount
- of heat produced by these chips, and taking into
- consideration the slow air flow provided by the
- fan in garden variety PC tower cases, I recommend
- an extra fan directly above the CPU for safer
- operation. If you measure the surface temperature
- of an i486 in a normal tower case without extra
- cooling after some time of operation, you may well
- come up with something like 80 - 90 degrees Celsius
- (that is 176 - 194 degrees Fahrenheit for those not
- familiar with metric units) [54,55]. You don't need
- the well known and expensive IceCap(tm) to effectively
- cool your CPU. A simple fan mounted directly above
- the CPU can bring the temperature down to about 50
- to 60 degrees Celsius (122 - 140 degrees Fahrenheit)
- depending on the room temperature and the temperature
- within the PC case (which depends on the total power
- dissipation of all the components and the cooling
- provided by the fan in the power unit). According
- to a simple rule known as Arrehnius' Law, lowering
- the temperature by 10 degrees Celsius slows down
- chemical reactions by a factor of two, thus lowering
- the temperature of your CPU by 30 degrees should
- prolong the live of the device by a factor of eight
- due to the slower aging process. If you are reluctant
- to add a fan to your system because of the additional
- noise, settle for a low-noise fan like those
- available from the German manufacturer Pabst (this
- is not meant to be an advertisement. I am just the
- happy owner of such a fan. Besides that, I have no
- connections to the firm).
- Intel 486DX2 is the name for Intel latest generation of 486 CPUs.
- Using the DX2 suffix instead of simply DX is meant
- to be an indicator that these are clock-doubled
- versions. A normal 486DX operates at the frequency
- provided by the incoming clock signal. A 486DX2
- generates a new clock signal from the incoming clock
- by means of a PLL (phase locked loop). In the DX2,
- this clock signal has twice the frequency of the
- incoming clock, hence the name clock-doubler. All
- internal parts of the 486DX2 (cache, CPU core, FPU)
- run at this higher frequency. Only the bus interface
- runs at the normal speed. That way, a 486DX-50 can
- run on a motherboard designed for 25 MHz operation.
- Since motherboards for 50 MHz operations are much
- harder to design than those for 25 Mhz, this makes
- a 486DX2-50 system easier to built and cheaper than
- a 486DX-50 system. For all operations that don't
- access off-chip resources (e.g. register operations)
- a 486DX2-50 provides exactly the same performance as
- a 486DX-50 and twice the performance of a 486DX-25.
- However, since the main memory in a 486DX2-50 systems
- still operates at 25 MHz, all instructions involving
- memory accesses are potentially slower than in a
- 486DX-50 system, whose memory also runs at 50 Mhz.
- The internal cache of the 486 helps this problem a
- bit, but overall performance of a 486DX2-50 is still
- lower than that of a 486DX-50, although Intel's
- documentation [32] shows this drop to be quite small.
- It depends a lot on the code one runs, though. The
- nice thing about the 486DX2 is that it allows easy
- upgrading of 25 and 33 Mhz 486 systems, since the
- 486DX2 is completely pin-compatible with the 486DX.
- Just take out the 486DX and plug in the new 486DX2.
- Note that power consumption of the 486DX2-50 equals
- that of the 486DX-50 (4000 mW typical), and that the
- 486DX2-66 exceeds this by about 30%. These chips get
- really hot in a standard PC case with no extra cooling.
- See the above paragraph for more detailed information
- on this problem.
- Intel 487SX is the coprocessor intended for use in 486SX systems.
- The 486SX is basically a 486DX without the floating-
- point unit (FPU) [48, 50]. Originally Intel sold
- 486DXs with a defective FPU as 486SXs but it has
- now completly removed the FPU part from the 486SX
- mask for mass production. The introduction of the
- 486SX in 1991 has been viewed mainly as a marketing
- 'trick' by Intel to take market share from the 386
- based systems once AMD became successful with their
- Am386 (AMD has taken as much as 40% of the 386 market
- due to some superior features such as higher clock
- frequency, lower power consumption, and a fully
- static design). A 486SX at 20 MHz delivers a bit
- less integer performance than a 40 MHz Am386. To add
- floating-point capabilities to a 486SX based system,
- it would be easiest to swap the 486SX with a 486DX
- which includes the FPU. However, Intel has prevented
- this easy solution by giving the 486SX a slightly
- different pin out [48, 51]. Since only three pins
- are assigned differently, clever board manufacturers
- have come out with boards that accept anything from
- a 486SX-20 to a 486DX2-50 in their CPU socket and
- provide a clean upgrade path this way. A set of
- three jumpers ensures correct signal assignment to
- the pins for either configuration. To upgrade systems
- without this feature, one has to buy the 487SX and
- put it into the "Performance Upgrade Socket" present
- in most 486SX systems. Once the 487SX was available,
- it was quickly found out that it is just a normal
- 486DX with a slightly different pin out [49]. Inserting
- the 487SX effectively shuts down the 486SX in the
- 486SX/487SX system, so the 486SX could be removed
- once the 487SX is installed. Since the shut down is
- logical, not electrical, the 486SX still uses power
- if used with the 487SX, although it is unoperational.
- Technically speaking, the solution Intel chose was
- the only practical way to provide a 486SX system with
- the high level of floating-point performance the
- 486DX offers. The CPU and FPU have to be on the same
- chip, otherwise the FPU can not make use of the cache
- on the CPU chip and there would be considerable
- overhead in CPU-FPU communication (similar to a
- 386/387 system), nullifying most of the arithmetic
- speedups over the 387. That the 486SX, 487SX, and
- 486DX are not pin-compatible seems to be purely for
- marketing reasons. To upgrade a 486SX based system,
- Intel also offers the OverDrive chip, which is just
- the same as a 487SX with internal clock doubling. It
- goes also goes into the "Performance Upgrade Socket"
- found in 486SX systems. The OverDrive roughly doubles
- the performance of a 486SX/487SX based system. For a
- explanation of clock doubling, see the description
- of the 486DX2 above. As the 486SX, the 487SX is
- available in 20 MHz and 25 MHz versions. At 20 MHz,
- the 487SX has a power consumption of max. 4000 mW.
- It is available in a 169 pin ceramic PGA (pin grid
- array).
- Weitek 3167 was introduced in 1989 to provide the fastest
- floating point performance possible on a 386 based
- system at that time. The Weitek Abacus 3167 is not
- a real coprocessor, strictly speaking, but rather
- a memory mapped peripheral device. The Weitek 3167
- was optimized for speed wherever possible. Besides
- using the faster memory mapped interface to the CPU
- (the 80x87 uses IO-ports), it does not support many
- of the features of the 80x87 coprocessors, allowing
- all of the chip's ressources to be concentrated on
- the fast execution of the basic arithmetic operations.
- For a more detailed description of the Weitek 3167 see
- the first chapter of this document. In benchmark
- comparisons, the Weitek 3167 provided up to 2.5 times
- the performance of an Intel 387DX coprocessor. For
- example, on a 33 MHz 3167 the Whetstone benchmark
- performed at 7574 kWhetstones/sec compared with the
- the 3743 kWhetstones/s for the Intel 387DX. Note
- however that these are single precision results and
- that the Weitek 3167's performance would drop to
- about half the stated rate for double precision,
- while the value for the Intel 387DX would not change
- much. Anyhow, before the advent of the Intel RapidCAD,
- the Weitek 3167 usually beat all 387 compatible
- coprocessors even for double precision operations
- [63,65,69]. For typical applications the advantage
- of the Weitek 3167 over the 387 clones is much smaller.
- In a benchmark test using AutoDesk's 3D-Studio the
- Weitek 3167 performed at 123% of the Intel 487DX's
- perfromance comapred with 106% for the Cyrix FasMath
- 83D87 and 118% for the Intel RapidCAD. The Weitek
- Abacus 3167 is packaged in a 121-pin PGA that fits
- into an EMC socket provided by most 386 based systems.
- It does *not* fit into the normal coprocessor socket
- designed to hold a 387 compatible coprocessor in a
- 68-pin PGA. To get the best of both worlds, one might
- want to use a Weitek 3167 and a 387 compatible
- coprocessor in the same system. These coprocessors
- can coexist in the same system just fine. Only problem
- is that most 386 based systems contain only one
- coprocessor socket, usually of the EMC (extended math
- coprocessor) type. Thus, you can install either a
- 387 coprocessor or a Weitek 3167, but not both. There
- are little daughter boards available though that fit
- into the EMC socket and provide two sockets, an EMC
- and a standard coprocessor socket. At 25 MHz, the
- Weitek 3167 has a power consumption of max. 1750 mW.
- At 33 MHz, the max. power consumption is 2250 mW.
- Weitek 4167 is a memory mapped coprocessor that has the same
- architecture as the 3167 and is designed to provide
- 486 based systems with the highest floating point
- performance available. It executes coprocessor
- instructions at three to four times the speed of
- the Weitek 3167. Although it is up to 80% faster
- than the Intel 468 in some benchmarks [1,69], the
- performance advantage for real application is more
- like 10%. The introduction of the 486DX2 processors
- has more or less obliterated the need for a Weitek
- 4167, since the DX2 CPUs provide the same performance
- and all the additional features the 80x87 has over
- the Weitek Abacus. The Weitek 4167 is packaged in
- a 142-pin PGA package that is only slightly smaller
- than the 486's package. At 25 MHz, it has a max.
- power consumption of 2500 mW [32].
-
- Chips & Technologies has shipped samples of their 38700 and
- 38700SX coprocessors, which are compatible with the Intel 387DX
- and Intel 387SX coprocessors, respectively. Both have already
- been tested in [1]. However, C&T's German distributor (Rein
- Elektronik, Nettetal) states that these coprocessors will
- become generally available not before 4Q 1992. The samples
- tested in [1] showed about the same performance as the Cyrix
- 83D87.
-
-
-
- Pricing
-
- Due to a recent price slashing by Cyrix and subsequently by Intel
- for 387 coprocessors, prices have dropped significantly for all
- 287 and 387 compatible coprocessors with hardly any price difference
- between manufacturers. 387DX compatible coprocessors typically sell
- for ~US$ 100 for all speeds except for 40 MHz versions which are
- typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90
- regardless of speed with the exception of the 33 MHz version, which
- are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87
- and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive,
- the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD
- for US$ 320 and haven't seen it offered for a better price. I see
- the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33
- being offered for US$ 1100. This price information reflects the
- price situation as of 08-14-92. Prices can be expected to drop
- slightly in the near future.
-
- If you have a demand for high floating-point performance, you
- should consider to buy a 486 based system rather than buying
- a 386 based system with an additional coprocessor. A 386 mother
- board for 33 MHz operation sell for ~ US$ 300, together with the
- coprocessor, costs total ~ US$ 400. A 486-33 ISA-board sells for
- US$ 650. While the 486-33 system is 60% more expensive than the
- 386/387 system, it also provides 100% more integer and floating-
- point performance (twice the performance). If you want to push
- your 386 based system to maximum floating-point performance and
- can't switch to a 486 based system for some reason, I recommend
- the Intel RapidCAD. It is both faster [1] and cheaper than installing
- a Weitek Abacus 3167 with your 386, which used to be the highest
- performing combination before the RapidCAD came out. Similarily,
- the introduction of the 486DX2 clock-doubler chips have obliterated
- the need for a Weitek 4167 to get maximum floating-point performance
- out of a 486 based system. A 486DX2-66 performs at or above the
- performance level of a 33 Mhz Weitek 4167, even if the latter
- uses single precision rather than double precision. The 486DX-66
- is rated by Intel at 24700 double precision kWhetstones/sec and
- 3.1 double precision Linpack MFLOPS. Of course, these benchmarks
- used the highest performance compilers available. But even with
- a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision
- MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description
- of the benchmarks mentioned, see the paragraph on benchmarks below).
- Although I haven't yet seen 486DX2-66 processors seen offered to
- the end users for upgrade purposes, I'll recommend the 486DX2-66
- to those that need highest floating-point performance and are
- planning on buying a new PC. The price difference between a
- 33 MHz 486DX motherboard and a 486DX2-66 motherboard is around
- US$ 600, well below the price for the Weitek Abacus 4167.
-
-
-
- Operation
-
- In a 80x86/80x87 system CPU instructions and coprocessor
- instructions are executed concurrently. This means that
- the CPU can execute CPU instructions while the coprocessor
- executes a coprocessor instruction at the same time. The
- concurrency is restricted somewhat by the fact that the
- CPU has to aid the coprocessor in certain operations. As
- the CPU and the coprocessor are fed from the same instruction
- stream and both instruction streams may operate on the same
- data, there has to be a synchronizing mechanism between the
- CPU and the coprocessor.
-
- 8086/8087 or 8088/8087 system, both of the chips look at the
- opcodes coming in from the bus. To do this, both chips have
- the same BIU (bus interface unit) and the 8086 BIU sends the
- status signals of its prefetch queue to the 8087 BIU. This
- assures that both processors always decode the same instructions
- in parallel. Since all coprocessor instruction start with the
- bit pattern 11011, it is easy for the 8087 to ignore all other
- instructions. Likewise the CPU ignores all coprocessor instructions
- except if they access memory. In this case, the CPU computes
- the address of the LSB (least significant byte) of the memory
- operand and does a dummy read. The 8087 then takes the data and does a dummy read.
- from the data bus. If more than one meory access is needed to
- load an memory operand, the 8087 requests the bus from the CPU,
- generates the consecutive addresses of the operand's bytes
- and fetches them from the data bus. After completing the operation,
- the 8087 hands bus control back to the CPU. Since 8087 and CPU
- are hooked up to the same synchronous bus, they have to run at
- the same speed. This means that with the 8087, only synchronous
- operation of CPU and coprocessor is possible. Another 8087
- coprocessor instruction can only be started if the previous one
- has been completed in the NEU (numerical execution unit) of the
- 8087. To prevent the 8086 from decoding a new coprocessor
- instruction while the 8087 is still excuting the previous
- coprocessor instruction, the following mechanism is used: The
- compilers and assemblers automatically generate a WAIT instruction
- before each coprocessor instruction. The WAIT instruction tests
- the /TEST pin until its input becomes "LOW". In 8086/8087 systems,
- the 8086 /TEST pin is connected to the 8087 BUSY pin. As long
- as the NEU executes a coprocessor instruction, it forces its
- BUSY pin "HIGH". Thus the WAIT instruction in front of every
- coprocessor instruction stops the CPU until a still executing
- previous coprocessor instruction has finished. The same
- synchronization is used before the CPU accesses data that
- was written by the coprocessor. A WAIT instruction after the
- coprocessor instruction that writes to memory causes the CPU to
- stop until the coprocessor has transferred the data to memory,
- after which the CPU can safely access the data.
-
- With the help of an additional chip, the 8087 can also be inter-
- faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g.
- from Philips, Siemens) in the 1982/1983 time frame, but with
- the introduction of the IBM AT which used the 80286, it lost all
- significance for the PC market. The 80C186 (CMOS version of the
- 80186) nowadays sells as an embedded controller and can be combined
- with a 80C187 coprocessor which is based on the internals of the
- Intel 387 [37].
-
- The 80287 CPU-interface is totally different from the solution
- used in the 8087. Since the 80286 implements memory protection
- via an MMU based on segmentation, it would have been much to
- expensive to duplicate the whole protection logic on the coprocessor
- for an interface solution similar to the 8087. In a 80286/80287
- system, the CPU fetches and stores all opcodes and operands for
- the coprocessor. Information is passed through ports F8h - FFh.
- As these ports are accessible under program control, care must
- be taken to not accidentally perform write operation to them, as
- this could corrupt the information in the math coprocessor.
- The execution unit of the 80287 is practically identical to that
- of the 8087, that is, nearly all coprocessor instructions execute
- in the same number of clock cycles on both coprocessors. Due to
- the additional overhead of the CPU/coprocessor interface (at
- least ~40 clock cycles), a 8 MHz 80286/80287 combination can be
- slower than a 8086/8087 system running at the same speed for
- floating point intensive programs. Additionally, most of the
- older 286 boards were configured to run the coprocessor at 2/3
- the speed of the CPU, making use of the ability of the 80287
- to run asynchronous with the CPU. The 80287 has a CKM pin that
- causes the incoming system clock to be divided by three for
- the coprocessor if it is tied to ground. The 80286 always
- divides the system clock by two internally. Thus the ratio 2/3.
- However, when the CKM (ClocK Mode) pin is tied high on the 80287,
- it does not divide the CLK input. This feature has been exploited
- by the maker of coprocessor speed sockets. These sockets tie
- CKM high and supply their own CLK signal with a built-in oscillator,
- thereby allowing the 80287 or compatible to run at a much higher
- speed than the CPU. With an IIT or Cyrix 287 one can have a
- 20 MHz coprocessor running with a 8 MHz 80286. Note however that
- the floating-point performance in such a configuration does not
- scale linearly with the coprocessor clock, since all the data
- has to be passed through the much slower CPU. If the coprocessor
- executes mostly simple intructions such as addition and multiplication
- doubling the coprocessor clock in a 10 MHz system to 20 MHz does
- not show any performance increase at all [24]. The 80C287 by AMD
- is a 100% clone of the original Intel 80287, but is produced in
- CMOS not in NMOS as the original Intel chip. This makes for lower
- power consumption.
-
- The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals
- of a 387 coprocessor, but are pin-compatible to the original 287.
- However, these chips divide the system clock by two internally,
- as opposed to three in the original Intel 80287. Since the 80286
- also divides the system clock by two, they usually run synchronously
- with the CPU. They can also run asynchronously, though.
-
- The 8087/8087 combination can be characterized as a cooperation of
- partners with equal rights, while the 80286/287 is more a master-
- slave relationship. This makes synchronization much more easy, since
- the complete instruction and data flow of the coprocessor goes thru
- the CPU. Before executing most coprocessor instructions, the 80286
- tests its /BUSY pin which is hooked up to the 287 coprocessor and
- signals if the 80287 is still executing a previous coprocessor
- instruction or has encountered an exception. The 80286 then waits
- until the 80287 is not busy before loading the coprocessor instruction
- into the coprocessor. Therefore, a WAIT instruction before every
- coprocessor instruction is not required. These WAITs are permissible,
- but not necessary in 80287 programs. The second form of WAIT
- synchronisation after the coprocessor has written a memory operand is
- still necessary on 286/287 systems.
-
- The coprocessor interface in 80386/80387 systems is very similar to
- the one found in 286/287 systems. However, to prevent corruption
- of the coprocessor's contents by programming errors, the IO-ports
- 800000F8 - 800000FF are used which are not user accessible. The
- interface has been optimized and uses 32-bit transfers. The overhead
- of the interface has been reduced to about 16-20 clock cycles. For
- some operations on the 387 'clones', that take less than 16 clock
- cycles to complete this effectively limits the execution rate of
- coprocessor instructions. The only sensible solution to provide
- even higher floating point performance was to integrate the CPU
- and coprocessor functionality onto the same chip. This is what
- Intel did with the 80486. The FPU in the 486 also benefits from
- the instruction pipelining and from the integrated cache.
-
-
-
- Performance
-
- Several computer magazines have published performance comparisons
- at the application level for the 387 coprocessors and Weitek's
- ABACUS 3167 and 4167 chips [1,25,68,70]. Applications tested included
- AutoCAD R11, RenderStar, Quattro Pro, Lotus 1-2-3, and AutoDesk's
- 3D-Studio. For most tests, performance improvements for the 387
- clones over Intel's 387DX were small to marginal, the clones running
- the applications no more than 5% to 15% faster than the Intel 387DX.
- In the test of 3D-Studio, one of the few programs that supports
- the Weitek Abacus, the Weitek 3167 improved performance by 23%
- over an Intel 387DX and the 4167 improved performance by 10% over
- the 486 [1].
-
-
- The Intel Math Coprocessor Utilities Disk that accompanies the
- Intel 387DX coprocessor has a demonstration program that shows
- the speedup of certain application programs when run with the
- Intel coprocessor vs. a system with no coprocessor.
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Art&Letters 87.0 sec 34.8 sec 150%
- Quattro Pro 8.0 sec 4.0 sec 100%
- Wingz 17.9 sec 9.1 sec 97%
- Mathematica 420.2 sec 337.0 sec 25%
-
-
- The following table is an excerpt from [70]:
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Corel Draw 471.0 sec 416.0 sec 13%
- Freedom Of Press 163.0 sec 77.0 sec 112%
- Lotus 1-2-3 257.0 sec 43.0 sec 597%
-
-
- The following table is an excerpt from [25]:
-
- Application Time w/o 387 Time w/ 387 Speedup
-
- Design CAD, Test1 98.1 sec 50.0 sec 96%
- Design CAD, Test2 75.3 sec 35.0 sec 115%
- Excel, Test 1 9.2 sec 6.8 sec 35%
- Excel, Test 1 12.6 sec 9.3 sec 35%
-
-
-
- The performance statistics below were put together with the
- help of four widely known numeric benchmarks and two benchmarks
- developed by me. Three Pascal programs, one FORTRAN program,
- and two assembly language program were used. The assembly language
- programs were linked with Turbo-Pascal 6.0 for library support,
- especially to include the coprocessor emulator of the TP 6.0
- run-time library. The Pascal programs were compiled with Turbo
- Pascal 6.0 from Borland International, a non-optimizing compiler
- that produces 16-bit code. The FORTRAN program was compiled using
- MS FORTRAN 5.0, an optimizing compiler that generates 16-bit
- code. All programs except PEAKFLOP and SAVAGE, which use double
- extended precision, use double precision variables. Note that
- using a highly optimizing compiler producing 32-bit code you
- will see much higher performance for some benchmarks. For example,
- Intel rates the 33 MHz 386/387DX at 3290 KWhetstones/sec and 0.4
- double precision LINPACK MFLOPS [28,29]. The 33 MHz Intel 486 is
- rated by Intel at 12300 KWhetstones/sec and 1.6 double precision
- LINPACK MFLOPS [30]. The compilers used in these benchmarks run by
- the chip vendor are the ones that give the highest performance
- available. These compilers are in the US$ 1000+ price range.
- Some of them may be experimental or prereleased versions not
- available to the general public. The relative performance of
- one coprocessor to another could vary depending on the code
- generated by compilers. Non-optimizing compilers tend to generate
- a high percentage of operations which access variables in memory,
- while optimizing compiler produce code that contains many
- operations involving registers. Thus it is well possible that
- coprocessor A beats coprocessor B running benchmark Z if compiled
- with compiler C, but B beats A when the same benchmark is compiled
- using compiler D. All benchmark in this overview were run from
- floppy under a 'bare-bones' MS-DOS 5.0 without the CONFIG.SYS
- and AUTOEXEC.BAT files. This way, it was made sure no TSR or
- other program unnecessarily stole computing resources from the
- benchmarks.
-
- Coprocessor performance also depends on the motherboard, or more
- specifically the chip set used on the motherboard. In [34] and [35]
- identically configured motherboards using different 386 chip sets
- were tested. Among other tests a coprocessor benchmark was run
- which is based on a fractal computation and its execution time
- recorded. The following tables showing coprocessor performance
- to vary with the chip set have been copied from these articles
- in abridged form.
-
- Cyrix Cyrix
- chip set 387+ chip set 83D87
-
- Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
- Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
- ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
- Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHZ 27.38 sec 91.6%
-
- This shows that performance of the same coprocessor can vary by
- up to ~10% depending on the chip set used on your board, at least
- for 386 motherboards (similar numbers for 286, 386sx, and 486 are
- unfortunately not available). The benchmarks for this article were
- run on a board with the Forex chip set, which is one of the fastest
- 386 chip sets there is, not only with respect to floating-point
- performance [35].
-
-
- Description of benchmarks
-
- PEAKFLOP is the kernel of a fractal computation. It consists
- mainly of a tight loop written in assembly code and fine tuned
- to give maximum performance. All variables are held in the
- CPU's and coprocessor's registers, so the only memory access
- is for opcode fetches. The main loop contains three multiplications
- and five additions/subtractions. This ratio is fairly typical
- for other floating point intensive programs as well. The whole
- program fits nicely into even a very small CPU cache. Due to
- the nature of this program, its MFLOPS rate is hardly to be
- exceeded by any program that calculates anything useful. Thus
- the name PEAKFLOP. You will find the source code for PEAKFLOP
- in appendix B.
-
- TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation
- matrix (a 4x4 matrix). Each vector consists of four double precision
- values. Multiplying vectors with a matrix is a typical operation in
- the manipulation (e.g. rotation) of 3D objects which are made up from
- many vectors decribing the object. This benchmark stresses addition
- and multiplication as well as memory access. For each vector, 16
- multiplications and 12 additions are used. About 256 kByte of data
- is accessed during the benchmark. TRNSFORM is implemented as an
- optimized assembler program linked with the Turbo Pascal 6.0 library.
- For the IIT 3C87, a special version was written that makes use of
- the special F4X4 instruction available on that coprocessor. F4X4
- does a full multiplication of a 4x4 matrix by a 4x1 vector in a
- single instruction. The full source code for the TRNSFORM program is
- in appendix B.
-
- LLL is short for Lawrence Livermore Loops [21], a set of kernels
- taken from real floating point extensive programs. Some of these
- loops are vectorizable, but since we don't deal with vector
- processors here, this doesn't matter. For this test, LLL was
- adapted from the FORTRAN original [20] to Turbo Pascal 6.0. By
- variable overlaying (similar to FORTRAN's EQUIVALENCE statement)
- memory allocation for data was reduced to 64 kB, so all data fits
- into a single 64 kB segment. The older version of LLL is used here
- which contains 14 loops. There also exists a newer, more elaborate
- version consisting of 24 kernels. The kernels in LLL exercise only
- multiplication and addition. The MFLOPS rate reported is the
- average of the MFLOPS rate of all 14 kernels as reported by the
- LLL program. LLL and Whetstone results (see below) are reported
- as returned by my COMPTEST test program in which they have been
- included as a measure of coprocessor/FPU performance. COMPTEST
- has been compiled under Turbo Pascal 6.0 with all 'optimizations'
- on and using my own run-time library, which gives higher perfor-
- mance than the one included with TP 6.0. My library is available
- as TPL60N15.ZIP from garbo.uwasa.fi and ftp-sites that mirror
- this site.
-
- Linpack [5] is a well known floating-point benchmark that also
- heavily exercises the memory system. Linpack operates on large
- matrices and takes up about 570 kB in the version used for this
- test. This is about the largest program size a pure DOS system
- can accomodate. Linpack was originally designed to estimate
- performance of BLAS, a library of FORTRAN subroutines that
- handles various vector and matrix operations. It uses two routines
- from BLAS which are thought to be typical of the matrix operations
- used by BLAS. Both routines only use addition/subtraction and
- multiplication. The FORTRAN source code for Linpack can be
- obtained from the automated mail server netlib@ornl.gov. Linpack
- was compiled using MS Fortran 5.0 in the HUGE memory model (which
- can handle data structures larger than 64 kB) and with compiler
- switches set for maximum optimization. Linpack repeatedly does
- the same test. The number reported is the maximum MFLOPS rate
- returned by Linpack. Linpack MFLOPS ratings for a great number
- of machines are contained in [6]. This PostScript document is
- also available from netlib@ornl.gov.
-
- Whetstone [2,3,4] is a synthetic benchmark based upon statistics
- collected about the use of certain control and data structures
- in programs written in high level languages. Based on these
- statistics, Whetstone tries to mirror a 'typical' HLL program.
- Whetstone performance is expressed by how many theoretical
- 'whetstone' instructions are executed per second. It was
- originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and
- Linpack, Whetstone not only uses addition and multiplication
- but exercises all basic arithmetic operations as well as some
- transcendental functions. Whetstone performance depends on the
- speed of the coprocessor as well as on the speed of the CPU,
- while PEAKFLOP, LLL, and Linpack place a heavier burden on the
- coprocessor/FPU. There exists an old and a new version of
- Whetstone. Note that results from the two versions can differ
- by as much as 20% for the same test configuration. For this
- test, the new version in Pascal from [3] was used. It was
- compiled with Turbo Pascal 6.0 and my own library (see above)
- with all 'optimizations' on.
-
- SAVAGE tests the performance of transcendental function
- evaluation. It is basically a small loop in which the sin,
- cos, arctan, ln, exp, and sqrt functions are combined in a
- single expression. While sin, cos, arctan, and sqrt can be
- evaluated directly with a single 387 coprocessor instruction
- each, ln and exp need additional preprocessing for argument
- reduction and result conversion. According to [14], the Savage
- benchmark was devised by Bill Savage, and is distributed by:
- The Wohl Engine Company, Ltd., 8200 Shore Front Parkway,
- Rockaway Beach, NY 11693, USA. Usually, Savage is programmed
- to make 250,000 passes though the loop. Here only 10,000 loops
- are executed for a total of 60,000 transcendental function
- evaluations. The result is expressed in function evaluations
- per second. SAVAGE source code was taken from [7] and compiled
- with Turbo Pascal 6.0 and my own run-time library (see above).
-
-
- Benchmark results for 387 coprocessors, coprocessor emulators and
- the Intel RapidCAD and Intel 486 CPUs.
-
-
- 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- 386, EM87 0.0084 0.0080 0.0060 0.0060 31 502 ##
- 386, Franke387 0.0369 0.0295 0.0233 0.0215 164 4002 $$
- 386, TP 6 Emu 0.0316 0.0273 0.0200 0.0190 160 3794 %%
- Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677
- ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926
- IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766
- IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 ??
- Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322
- Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
- Intel 486 2.4762 2.1335 1.1110 0.8204 6195 98522
-
-
- 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- 386, EM87 0.0070 0.0040 0.0050 0.0050 26 418 ##
- Franke387 0.0307 0.0246 0.0194 0.0179 137 3335 $$
- 386, TP 6 Emu 0.0263 0.0227 0.0167 0.0158 133 3160 %%
- Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860
- ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431
- IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020
- IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 ??
- Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890
- Cyrix 83D87 1.013 N/A 0.333 0.273 2550 N/A
- Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
- Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192
-
- For comparison:
-
- PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934
- i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203
- i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++
- i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 &&
- i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !!
- i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
-
- HW configuration for test of 387 coprocessors and Intel RapidCAD:
- System A: Motherboard with Forex chip set, 128 kB CPU Cache, 8 MB RAM
-
- HW configuration for test of 486 FPU (extra fan for 40 MHz operation):
- System B: Motherboard with SIS chip set, 256 kB CPU Cache, 8 MB RAM
-
- ## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator
- that loads as a TSR. It uses INT 7 traps emitted by 80286, 80386
- systems with no coprocessor upon encountering coprocessor
- instructions to catch coprocessor instructions and emulate them.
- Whetstone and Savage benchmarks for this test were compiled
- with the original TP 6.0 library, as EM87 chokes on the 387
- specific FSIN and FCOS instructions used in my own library if
- a 387 is detected. Obviously EM87 identifies itself as a 387,
- but has no support for 387 specific instructions.
- $$ Franke387 is a commercial 387 emulator that is also available in
- a shareware version. For this test, shareware version V2.4 was
- used. Franke387 unlike many other emulators supports all 387
- instructions. It is loaded as a device driver and uses INT 7
- to trap coprocessor instructions.
- %% These benchmarks were run using the built-in coprocessor emulators
- of the TP 6.0 and the MS FORTRAN 5.0 run-time libraries.
- ?? The 3C87 specific F4X4 instruction was used in the vector trans-
- formation benchmark.
- ++ Older motherboard with no chip set (discrete logic), no CPU cache,
- 16 MB RAM
- && System A, CPU cache disabled via extended set-up, turbo-switch
- set to half speed (that is, 20 MHz)
- !! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM
- due to the fast CPU used here, performance figures are somewhat
- higher than can be expected for a 80286/287 combination, except
- for the PEAKFLOP benchmark, which is basically coprocessor limited
- ** 8086/8087 system with 640 kB RAM
-
-
- Since neither a Weitek coprocessor nor a compiler that generates
- code for the Weitek chips were available, performance data for
- the Weitek Abacus are given here according to [31,32] and scaled to
- show performance of a 33 MHz system. The benchmarks were compiled
- using highly optimizing 32-bit compilers.
-
- Single Prec. Double Prec. Double Prec.
-
- 3167 4167 3167 4167 387 486
-
- Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6
- Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300
-
- Note that for the Intel coprocessors, running programs in single
- vs. double precision doesn't provide much of an performance advantage
- since all internal calculations are always done in extended precision.
- Using Weitek coprocessors however, performance nearly doubles when
- switching fron double to single precision. For double precision
- calculations using only basic arithmetic, the Weitek Abacus can
- provide performance at twice the level of the respective Intel
- coprocessor (387/486) clocked at the same speed at most.
-
-
- Speed of various coprocessor instructions measured in clock cycles
- as measured with my program 87TIMES. Error is +/- one clock cycle,
- except for the Intel 80287. Times for the 80287 were determined on
- a system with a 20 MHz 80386 and a 5 MHz Intel 80287. Therefore,
- times may differ from a genuine 80286/287 system, especially for
- those instructions that access an operand in memory. Since the
- times are stated as the number of coprocessor clock cycles used,
- the faster 386 which can execute four clock cycles where the 80287
- executes one clock cycle may decrease memory access times as seen
- by the coprocessor.
-
-
- Intel Intel Cyrix Cyrix ULSI IIT Intel Intel
- i486 RapidCAD 387+ 83D87 83C87 3C87 387DX 80387
-
- FLD1 | 5 7 17 17 17 22 27 35
- FLDZ | 5 7 17 17 17 22 22 29
- FLDPI | 8 9 17 17 17 22 37 45
- FLDLG2 | 8 9 17 17 17 22 37 44
- FLDL2T | 8 9 17 17 17 22 37 44
- FLDL2E | 8 9 17 17 17 22 37 44
- FLDLN2 | 8 9 17 17 17 22 37 45
- FLD ST(0) | 5 7 17 17 17 22 17 24
- FST ST(1) | 4 7 17 17 17 17 17 24
- FSTP ST(0) | 5 7 17 17 17 18 23 25
- FSTP ST(1) | 5 7 17 17 17 17 23 25
- FLD ST(1) | 5 7 17 17 17 22 17 25
- FXCH ST(1) | 5 7 17 17 17 22 22 25
- FILD [Word] | 13 16 35 36 41 46 46 65
- FILD [DWord] | 12 17 30 30 37 37 40 51
- FILD [QWord] | 13 20 40 40 47 47 45 66
- FLD [DWord] | 7 13 30 36 32 37 25 35
- FLD [QWord] | 7 15 40 44 42 47 35 45
- FLD [TByte] | 10 19 52 52 52 57 57 61
- FBLD [TByte] | 83 91 84 66 145 205 70 278
- FIST [Word] | 32 34 43 42 45 54 72 92
- FIST [DWord] | 33 35 48 44 48 57 74 91
- FST [DWord] | 11 14 44 42 49 41 46 47
- FST [QWord] | 16 18 56 54 60 53 58 60
- FISTP [Word] | 32 35 43 42 45 49 73 93
- FISTP [DWord] | 34 37 48 44 48 52 75 88
- FISTP [QWord] | 35 37 57 53 61 63 86 96
- FSTP [DWord] | 12 13 44 42 48 37 46 42
- FSTP [QWord] | 16 17 56 55 60 50 59 57
- FSTP [TByte] | 14 16 59 58 58 56 67 70
- FBSTP [TByte] | 171 175 101 98 126 216 147 535
- FINIT | 18 35 18 18 18 18 19 25
- FCLEX | 8 24 18 18 18 18 19 25
- FCHS | 8 11 17 17 17 17 31 35
- FABS | 6 8 17 17 17 17 28 31
- FXAM | 13 15 17 17 17 17 37 40
- FTST | 5 7 22 17 22 22 32 35
- FSTENV | 68 85 127 127 135 127 162 169
- FLDENV | 45 62 109 109 123 109 122 132
- FSAVE | 160 172 359 359 366 377 467 504
- FRSTOR | 131 206 361 361 369 367 424 453
- FSTSW [mem] | 4 7 16 16 17 16 17 22
- FSTSW AX | 4 7 14 14 14 14 14 17
- FSTCW [mem] | 4 7 16 16 16 16 16 22
- FLDCW [mem] | 5 14 28 28 29 29 29 34
- FADD ST,ST(0) | 8 9 22 17 17 22 27 30
- FADD ST,ST(1) | 9 10 22 17 17 22 22 34
- FADD ST(1),ST | 10 10 22 17 17 22 23 35
- FADDP ST(1),ST | 11 11 22 17 17 22 23 34
- FADD [DWord] | 9 14 30 30 33 32 31 42
- FADD [QWord] | 9 16 40 40 43 42 41 51
- FIADD [Word] | 20 21 36 36 43 43 49 77
- FIADD [DWord] | 20 25 30 30 38 38 43 65
- FSUB ST(1),ST | 10 10 22 17 17 22 23 35
- FSUBR ST(1),ST | 9 10 22 17 20 25 27 35
- FSUBRP ST(1),ST | 10 10 22 17 17 22 23 35
- FSUB [DWord] | 11 14 30 30 32 32 30 41
- FSUB [QWord] | 11 16 40 40 42 43 40 51
- FISUB [Word] | 21 21 36 36 44 43 56 77
- FISUB [DWord] | 21 25 30 30 39 38 43 65
- FMUL ST,ST(1) | 16 17 22 22 22 27 38 56
- FMUL ST(1),ST | 16 17 22 22 22 27 40 60
- FMULP ST(1),ST | 16 17 22 22 22 27 38 59
- FIMUL [Word] | 22 23 36 36 50 43 50 77
- FIMUL [DWord] | 22 25 36 36 45 38 46 73
- FMUL [DWord] | 11 14 36 36 32 38 31 48
- FMUL [QWord] | 14 16 46 46 42 48 41 72
- FDIV ST,ST(0) | 73 74 38 23 52 57 92 95
- FDIV ST,ST(1) | 73 74 42 36 52 57 78 95
- FDIV ST(1),ST | 73 74 42 36 52 57 78 99
- FDIVR ST(1),ST | 73 74 42 36 53 57 77 100
- FDIVRP ST(1),ST | 73 74 42 36 52 57 78 101
- FIDIV [Word] | 84 85 61 54 79 73 105 144
- FIDIV [DWord] | 84 85 54 47 74 68 101 129
- FDIV [DWord] | 73 74 54 48 63 62 78 100
- FDIV [QWord] | 73 74 64 57 72 72 79 113
- FSQRT (0.0) | 26 28 17 17 17 22 27 35
- FSQRT (1.0) | 83 84 72 36 87 57 112 128
- FSQRT (L2T) | 86 87 72 36 87 57 102 133
- FXTRACT (L2T) | 17 17 22 17 32 76 56 68
- FSCALE (PI,5) | 30 31 22 36 47 77 57 80
- FRNDINT (PI) | 31 31 27 19 32 27 47 74
- FPREM (99,PI) | 58 60 102 52 57 52 77 100
- FPREM1(99,PI) | 90 91 102 57 62 52 102 119
- FCOM | 5 7 17 17 27 17 27 34
- FCOMP | 6 7 17 17 27 17 28 35
- FCOMPP | 7 8 17 17 27 22 28 34
- FICOM [Word] | 16 20 36 36 49 37 61 77
- FICOM [DWord] | 18 25 30 30 44 32 48 61
- FCOM [DWord] | 7 14 30 30 33 32 31 35
- FCOM [QWord] | 7 15 40 40 43 42 41 51
- FSIN (0.0) | 25 27 97 17 17 22 37 45
- FSIN (1.0) | 310 314 162 116 492 222 512 593
- FSIN (PI) | 88 90 187 121 67 217 132 155
- FSIN (LG2) | 284 288 84 73 445 184 434 505
- FSIN (L2T) | 299 303 177 121 472 217 452 533
- FCOS (0.0) | 25 27 157 17 22 22 37 44
- FCOS (1.0) | 302 306 107 87 487 212 457 540
- FCOS (PI) | 89 92 257 151 62 222 197 230
- FCOS (LG2) | 300 304 152 106 452 192 502 584
- FCOS (L2T) | 307 311 242 156 467 222 507 598
- FSINCOS (0.0) | 26 29 17 17 22 31 41 54
- FSINCOS (1.0) | 353 357 172 126 492 416 536 637
- FSINCOS (PI) | 105 107 262 161 67 421 226 273
- FSINCOS (LG2) | 340 344 157 116 457 361 531 628
- FSINCOS (L2T) | 347 351 247 166 472 421 536 643
- FPTAN (0.0) | 26 28 17 17 22 31 36 43
- FPTAN (1.0) | 267 269 147 121 537 306 322 392
- FPTAN (PI) | 145 146 227 136 112 306 167 212
- FPTAN (LG2) | 244 246 132 91 502 276 297 363
- FPTAN (L2T) | 247 249 217 136 517 306 297 363
- FPATAN (0.0) | 39 41 27 22 22 27 97 92
- FPATAN (1.0) | 294 298 157 121 372 602 358 433
- FPATAN (PI) | 304 307 192 143 357 422 378 468
- FPATAN (LG2) | 289 293 157 126 362 382 373 447
- FPATAN (L2T) | 304 307 192 141 362 422 373 463
- F2XM1 (0.0) | 26 28 17 17 17 22 37 38
- F2XM1 (LN2) | 209 212 122 86 392 287 297 348
- F2XM1 (LG2) | 204 207 107 76 377 287 292 340
- FYL2X (1.0) | 60 60 42 36 72 92 112 127
- FYL2X (PI) | 294 297 162 111 452 357 393 497
- FYL2X (LG2) | 311 314 162 106 457 337 408 512
- FYL2X (L2T) | 293 296 162 111 437 357 393 496
- FYL2XP1 (LG2) | 334 337 167 101 462 282 433 533
-
-
-
- 80386 + 80386 + 80386 +
- Intel Intel Franke387 TP 6.0 EM87
- 8087 80287 Emulator Emulator Emulator
-
- FSTP ST(0) | 26 54 507 358 2115
- FLD1 | 26 55 481 422 1626
- FLDZ | 21 53 480 416 1646
- FLDPI | 26 55 486 443 1626
- FLDLG2 | 26 56 486 423 1626
- FLDL2T | 26 55 486 440 1626
- FLDL2E | 26 53 486 423 1626
- FLDLN2 | 26 55 486 441 1626
- FLD ST(0) | 31 55 493 362 1851
- FST ST(1) | 26 54 489 355 1931
- FSTP ST(1) | 21 55 507 356 2116
- FLD ST(1) | 26 55 493 362 1852
- FXCH ST(1) | 21 57 497 486 2187
- FILD [Word] | 58 90 667 712 2259
- FILD [DWord] | 64 74 608 812 2164
- FILD [QWord] | 74 93 652 707 2971
- FLD [DWord] | 49 44 633 473 2077
- FLD [QWord] | 54 57 641 524 2336
- FLD [TByte] | 59 45 607 492 2063
- FBLD [TByte] | 309 310 2019 1512 17827
- FIST [Word] | 79 72 854 766 2418
- FIST [DWord] | 84 80 865 518 2325
- FST [DWord] | 89 85 686 441 2200
- FST [QWord] | 99 92 703 516 2481
- FISTP [Word] | 79 80 864 794 2620
- FISTP [DWord] | 79 81 879 541 2523
- FISTP [QWord] | 88 75 904 916 3226
- FSTP [DWord] | 89 75 713 467 2400
- FSTP [QWord] | 93 72 732 538 2678
- FSTP [TByte] | 49 21 685 467 2124
- FBSTP [TByte] | 528 472 3305 1555 27013
- FINIT | 11 10 742 641 1369
- FCLEX | 11 10 440 323 912
- FCHS | 21 54 460 354 1744
- FABS | 21 54 456 349 1738
- FXAM | 21 54 481 380 1551
- FTST | 51 75 585 386 2721
- FSTENV | 54 57 928 519 2104
- FLDENV | 48 50 1125 450 1631
- FSAVE | 214 244 1949 976 2749
- FRSTOR | 209 227 2182 657 2225
- FSTSW [mem] | 28 10 516 401 1189
- FSTSW AX | N/A 55 451 N/A N/A
- FSTCW [mem] | 28 10 506 359 1167
- FLDCW [mem] | 19 47 524 437 1584
- FADD ST,ST(0) | 86 128 643 706 2805
- FADD ST,ST(1) | 85 116 707 808 3093
- FADD ST(1),ST | 92 131 664 812 3146
- FADDP ST(1),ST | 92 129 704 799 3143
- FADD [DWord] | 105 122 874 969 3139
- FADD [QWord] | 115 122 888 1021 3396
- FIADD [Word] | 115 122 940 1211 3330
- FIADD [DWord] | 125 122 882 1297 3215
- FSUB ST(1),ST | 88 130 738 817 3156
- FSUBR ST(1),ST | 96 132 740 868 3004
- FSUBRP ST(1),ST | 99 132 733 805 3301
- FSUB [DWord] | 119 122 918 1018 3127
- FSUB [QWord] | 129 123 932 1070 3632
- FISUB [Word] | 115 123 977 1081 3802
- FISUB [DWord] | 125 125 940 980 4161
- FMUL ST,ST(1) | 145 151 810 1368 3924
- FMUL ST(1),ST | 145 151 817 1377 3962
- FMULP ST(1),ST | 148 168 840 1365 4164
- FIMUL [Word] | 132 151 1039 1517 4039
- FIMUL [DWord] | 141 151 980 1643 3976
- FMUL [DWord] | 125 123 948 1480 3445
- FMUL [QWord] | 175 192 991 1602 4416
- FDIV ST,ST(0) | 201 207 726 1536 9789
- FDIV ST,ST(1) | 203 218 808 1658 10332
- FDIV ST(1),ST | 207 214 825 1655 10342
- FDIVR ST(1),ST | 201 206 819 1806 10213
- FDIVRP ST(1),ST | 201 205 845 1803 10409
- FIDIV [Word] | 237 227 980 1779 11225
- FIDIV [DWord] | 246 227 944 1680 11572
- FDIV [DWord] | 229 226 893 1722 10577
- FDIV [QWord] | 236 227 993 1777 10829
- FSQRT (0.0) | 21 57 512 382 1755
- FSQRT (1.0) | 186 206 1106 2504 37836
- FSQRT (L2T) | 186 207 1398 2467 37925
- FXTRACT (L2T) | 51 56 726 571 3326
- FSCALE (PI,5) | 41 56 817 443 3194
- FRNDINT (PI) | 51 58 808 800 7092
- FPREM (99,PI) | 81 131 1696 941 4098
- FPREM1(99,PI) | N/A N/A 1625 N/A N/A
- FCOM | 56 75 582 483 2799
- FCOMP | 61 92 616 485 2983
- FCOMPP | 61 90 661 476 3198
- FICOM [Word] | 79 77 808 861 3654
- FICOM [DWord] | 89 77 750 964 3684
- FCOM [DWord] | 74 75 741 625 3643
- FCOM [QWord] | 74 76 754 667 3771
- FSIN (0.0) | N/A N/A 639 N/A N/A
- FSIN (1.0) | N/A N/A 4640 N/A N/A
- FSIN (PI) | N/A N/A 2488 N/A N/A
- FSIN (LG2) | N/A N/A 3911 N/A N/A
- FSIN (L2T) | N/A N/A 3767 N/A N/A
- FCOS (0.0) | N/A N/A 740 N/A N/A
- FCOS (1.0) | N/A N/A 4777 N/A N/A
- FCOS (PI) | N/A N/A 2557 N/A N/A
- FCOS (LG2) | N/A N/A 4176 N/A N/A
- FCOS (L2T) | N/A N/A 3905 N/A N/A
- FSINCOS (0.0) | N/A N/A 714 N/A N/A
- FSINCOS (1.0) | N/A N/A 6049 N/A N/A
- FSINCOS (PI) | N/A N/A 4091 N/A N/A
- FSINCOS (LG2) | N/A N/A 5640 N/A N/A
- FSINCOS (L2T) | N/A N/A 5405 N/A N/A
- FPTAN (0.0) | 41 58 752 8381 2324
- FPTAN (1.0) | 581 582 6366 10817 29824
- FPTAN (PI) | 606 587 4388 12410 2300
- FPTAN (LG2) | 516 513 5939 12502 26770
- FPTAN (L2T) | 576 586 5723 12483 2301
- FPATAN (0.0) | 41 55 616 1208 10578
- FPATAN (1.0) | 736 736 1426 13446 34208
- FPATAN (PI) | 206 207 12835 13305 46903
- FPATAN (LG2) | 756 736 12490 13319 41312
- FPATAN (L2T) | 206 204 12922 13364 50149
- F2XM1 (0.0) | 16 56 563 723 1722
- F2XM1 (LN2) | 631 624 4178 11070 33823
- F2XM1 (LG2) | 611 585 4798 11116 32163
- FYL2X (1.0) | 56 57 961 1214 4327
- FYL2X (PI) | 946 961 8987 12858 40148
- FYL2X (LG2) | 1081 1038 8933 12748 46821
- FYL2X (L2T) | 926 886 8982 12712 38986
- FYL2XP1 (LG2) | 1026 1037 10485 11867 44708
-
- The Weitek 3167 and 4167 processors only implement the basic
- arithmetic functions (add, subtract, multiply, divide, square
- root) in hardware. Transcendental functions are implemented
- by means of a software library supplied by Weitek that uses
- the Weitek hardware to approximate the transcendental functions
- with polynomial and rational approximations. The clock cycle
- timings for the transcendental functions are average values,
- since execution time differs with the value of argument. The
- speed of transcendental functions for the 4167 is estimated
- based on the numbers in [31,33], from which this timing
- information has been extracted.
-
-
- Execution time for floating-point operations in clock cycles on
- Weitek coprocessors
-
- Single Precision Double Precision
-
- 3167 4167 3167 4167
-
- ABS 3 2 3 2
- NEG 6 2 6 2
- ADD 6 2 6 2
- SUB 6 2 6 2
- SUBR 6 2 6 2
- MUL 6 2 10 3
- DIVR 38 17 66 31
- SQRT 60 17 118 31
- SIN 146 ~50 292 ~100
- COS 140 ~50 285 ~100
- TAN 188 ~60 340 ~110
- EXP 179 ~60 401 ~130
- LOG 171 ~60 365 ~120
- F->ASCII 1000 N/A 1700 N/A //
- ASCII->F 1100 N/A 1800 N/A //
-
- // rough average of the timings given for different numeric
- formats by Weitek. Note that these conversions routines
- do much more work than the FBLD and FBSTP instructions
- provided by the 80x87 coprocessors. FBLD and FBSTP are
- useful for conversion routines but quite a bit of additional
- code is need for this purpose.
-
-
- Accuracy
-
- The IEEE-754 Standard for Binary Floating-Point Arithmetic [10,11]
- is fully implemented by Intel's 387 coprocessor [17]. Among other
- things, this means that the add, subtract, multiply, divide,
- remainder, and square root operations always deliver the 'exact'
- result. By exact it is meant that the coprocessor always delivers
- the machine number closest to the real result, which may not
- be representable exactly in the available numeric format. The
- 80387 implements the single, double, and double extended formats
- as specified in the standard as well as all functions required
- by it [17]. Note that earlier Intel coprocessors (the 8087 and
- the 80287) comply with a draft version of the standard that differs
- from the final version. These chips came out before the IEEE-754
- standard was finally accepted in 1985. As in the 80387, the basic
- arithmetic in the 8087 and the 80287 is exact in the sense that
- the computed result is always the machine number closest to the
- real result. However, there are some differences regarding certain
- operands like infinities and some operation like the remainder are
- defined differently. Some instructions have been added in the 80387,
- most notably the FSIN and FCOS operations. The argument range for
- some transcendental function has been extended [17]. Note that the
- IEEE-754 standard says nothing about the quality of the implementation
- of transcendental functions like sin, cos, tan, arctan, log. Intel
- uses a modified CORDIC [18,19] technique to compute the transcendental
- functions. Intel claims that maximum error in the 8087, 80287, and
- 80387 for all transcendental functions does not exceeed two bits
- in the mantissa of the double extended format, which features 64
- mantissa bits for an accuracy of approximately 19 decimal places
- [22,23]. This claim has been independently verified by a competing
- vendor [13]. This means that at least 62 of the 64 mantissa bits
- in a transcendental function result are correct.
-
- The Weitek Abacus 3167 and 4167 are 'mostly compatible' with
- IEEE-754 [31,32,33]. It supports the single precision and double
- precision numeric formats formats described in the standard as
- well as the four rounding modes required by it. However, due to
- the need for extremely high speed operation, some of the finer
- points of IEEE-754 have not been implemented. One of the most
- notable omissions is the missing support for denormal numbers.
- Denormals are always flushed to zero.
-
- The 387 clone makers claim 100% compatibility with Intel's 80387.
- So one would expect the same accuracy from their chips. For example,
- on the packaging of the IIT 3C87 it says that ".. the requirements
- of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states
- that their 83D87 complies fully with the IEEE-754 standard [12].
- Cyrix delivers with their copocessors some diagnostic software.
- This includes the program IEEETEST which is based on the IEEE test
- vectors from the Ph.D. thesis of Jerome T. Coonen [9]. A test using
- the IEEE test vectors has also been included into the RUNDIAG
- program on the Intel RapidCAD diagnostic disk. Rather than performing
- random tests, the test vectors check specific cases that may
- be hard to get right. Each test vector specifies the operation
- to be performed, the operands, precision and rounding mode to be
- used, and the result (including flags set) to be expected according
- to IEEE-754. I ran IEEETEST on all the available coprocessors/ FPUs.
- The Intel 486, Intel RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87,
- and the Cyrix 387+ passed with no errors. The ULSI 83C87 showed
- some minor flaws in the FCOM, FDIV, FMUL, and FSCALE operations,
- getting flag errors in about 1% of the tested cases, but no
- computational errors. However, for the IIT 3C87, the IEEETEST
- program showed flag *and* some computational errors (that is, wrong
- results) for all tested operations except FXTRACT and FCHS. The Intel
- 80287 shows numerous errors, but this it not surprising, since the
- 80287 does not comply with IEEE-754 but with an earlier draft of that
- standard, so it does some thing differently than required by the final
- version of the standard.
-
- Although IEEETEST is written in Turbo Pascal, the coprocessor
- emulator in the TP 6.0 library could not be tested since IEEETEST
- was compiled with the $E- switch excluding the emulator from
- program code. The public domain emulator EM87 could be tested, but
- hung in the last test which checks the implementation of the
- remainder operation. This is probably caused by some bug in the
- emulation of the FPREM instruction tested in this test. It is
- interesting to note how the error profile of EM87 matches exactly
- that of the Intel 80287, so it can be assumed that EM87 is a very
- good emulation of the 80287. The Franke387 V2.4 emulator hung in
- the division test quite early in IEEETEST. The tests performed
- up to the division test reported several errors.
-
-
-
- Explanatory text printed at the start of the IEEETEST program:
-
- JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his
- activities as a member of the floating-point working group that
- defined the IEEE 754-1985 Standard for Binary Floating-Point
- Arithmetic. Appendix C of his thesis presents FPTEST, a Pascal
- program written by J Thomas and JT Coonen. IEEETEST is a port of
- FPTEST and runs on PCs whose math coprocessor accepts 80387
- compatible floating-point instructions.
-
- IEEETEST reads test vectors from the file TESTVECS and compares
- the answer returned by the math coprocessor with the answer listed
- in the test vector. If these answers differ an 'F' is displayed,
- otherwise a '.'is displayed. Answers can differ due to two types
- of failures: numeric failures or flag failures. Numeric failures
- occur when the computed answer has the wrong value. Flag failures
- occur when the status (invalid operation, divide by zero, underflow,
- overflow, inexact) is incorrectly identified.
-
- TESTVECS is the concatenation of unmodified versions of all the
- test vectors distributed by UC Berkeley. The test data base is
- copyrighted by UC Berkeley (1985) and is being distributed with
- their permission. FPTEST and the test data base can be obtained
- by asking for 'IEEE-754 Test Vector' from UC Berkeley, Electrical
- Engineering and Computer Science, Industrial Liaison Program,
- 479 Corey Hall, Berkeley, CA, 94720 (415)643-6687.
-
- The initial version of this test data base for the proposed IEEE
- 754 binary floating-point standard (draft 8.0) was developed for
- Zilog, Inc. and was donated to the floating-point working group
- for dissemination. Errors in or additions to the distributed data
- base should be reported to the agency of distribution, with copies
- to Zilog, Inc., 1315 Dell Avenue, Campbell, CA, 95008.
-
-
- IEEETEST output for Intel 80387, Intel 387DX, Intel 486,
- Cyrix 83D87, Cyrix 387+, RapidCAD
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 3528 0 | 0 0 0 | 0 0 0
- Comparison C | 4320 0 | 0 0 0 | 0 0 0
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4311 0 | 0 0 0 | 0 0 0
- Fraction Part F | 624 0 | 0 0 0 | 0 0 0
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3978 0 | 0 0 0 | 0 0 0
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 2832 0 | 0 0 0 | 0 0 0
- Round to Integer I | 558 0 | 0 0 0 | 0 0 0
- Scalb S | 948 0 | 0 0 0 | 0 0 0
- Square Root V | 744 0 | 0 0 0 | 0 0 0
- Subtraction - | 3528 0 | 0 0 0 | 0 0 0
- Remainder % | 2984 0 | 0 0 0 | 0 0 0
- Totals | 31235 0 |
-
-
- IEEETEST output for ULSI 83C87
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 3528 0 | 0 0 0 | 0 0 0
- Comparison C | 4312 8 | 0 0 0 | 0 0 8
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4250 61 | 0 0 0 | 28 28 5
- Fraction Part F | 624 0 | 0 0 0 | 0 0 0
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3936 42 | 0 0 0 | 19 19 4
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 2828 4 | 0 0 0 | 0 0 4
- Round to Integer I | 558 0 | 0 0 0 | 0 0 0
- Scalb S | 930 18 | 0 0 0 | 6 6 6
- Square Root V | 744 0 | 0 0 0 | 0 0 0
- Subtraction - | 3528 0 | 0 0 0 | 0 0 0
- Remainder % | 2984 0 | 0 0 0 | 0 0 0
- Totals | 31102 133 |
-
-
- IEEETEST output for IIT 3C87
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 200 16 | 0 0 16 | 0 0 0
- Addition + | 3336 192 | 0 0 128 | 0 0 96
- Comparison C | 4224 96 | 0 0 96 | 0 0 0
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4159 152 | 0 0 124 | 0 0 116
- Fraction Part F | 600 24 | 0 0 24 | 0 0 24
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3702 276 | 0 0 248 | 0 0 100
- Negation - | 200 16 | 0 0 16 | 0 0 0
- Next After N | 2248 584 | 0 0 584 | 0 0 168
- Round to Integer I | 542 16 | 0 0 4 | 0 0 16
- Scalb S | 874 74 | 5 5 44 | 8 8 20
- Square Root V | 688 56 | 0 0 56 | 0 0 56
- Subtraction - | 3336 192 | 0 0 128 | 0 0 96
- Remainder % | 2844 140 | 0 0 140 | 0 0 116
- Totals | 29401 1834 |
-
-
- IEEETEST output for Intel 80287 run together with a 80386 CPU
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 2886 642 | 16 16 112 | 174 174 174
- Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 3777 534 | 18 18 37 | 169 169 165
- Fraction Part F | 552 72 | 24 24 24 | 24 24 24
- Logb L | 900 60 | 12 12 12 | 20 20 20
- Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 348 2484 | 768 768 768 | 504 504 526
- Round to Integer I | 546 12 | 0 0 0 | 4 4 4
- Scalb S | 663 285 | 45 43 26 | 102 98 46
- Square Root V | 720 24 | 4 4 4 | 8 8 8
- Subtraction - | 2886 642 | 16 16 112 | 174 174 174
- Remainder % | 708 2276 | 768 768 560 | 216 216 216
- Totals | 18850 12385 |
-
-
- IEEETEST output for EM87 coprocessor emulator run on a Intel 386 CPU
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 2886 642 | 16 16 112 | 174 174 174
- Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 3777 534 | 18 18 37 | 169 169 165
- Fraction Part F | 552 72 | 24 24 24 | 24 24 24
- Logb L | 900 60 | 12 12 12 | 20 20 20
- Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 348 2484 | 768 768 768 | 504 504 526
- Round to Integer I | 546 12 | 0 0 0 | 4 4 4
- Scalb S | 663 285 | 45 43 26 | 102 98 46
- Square Root V | 720 24 | 4 4 4 | 8 8 8
- Subtraction - | 2886 642 | 16 16 112 | 174 174 174
-
-
- To complement the checks done by IEEETEST I wrote some short
- programs DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test
- the following features:
-
- 1. support for denormals in all precisions (single, double, extended)
- 2. support for the four IEEE rounding modes (up, down, nearest, chop)
- 3. support for precision control
-
- Note that 1) and 2) are required for IEEE conformance, while 3)
- is required for compatibility with Intel's coprocessors. Precision
- control forces the results of the FADD, FSUB, FMUL, FDIV, and FSQRT
- instruction to be rounded to the specified precision (single, double,
- double extended). This feature is provided to obtain compatibility
- with certain programming languages [17]. By specifying lower
- precision, one effectively nullifies the advantages of extended
- precision intermediate results. The programs that test precision
- control and rounding control are designed to return a different
- result for each of the modes for the same sequence of operation.
- The source code of the programs can be found in appendix A. The
- Intel 8087 and 80287 were not tested with DENORMTS since Turbo
- Pascal does not support extended precision denormals on 8087/80287
- processors, so the denormal test fails anyway. The 8087 and 287
- pass the RCTRL and PCTRL tests, though.
-
-
- These are the results for the Intel 387, Intel 387DX, Intel 486,
- Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and the EM87 emulator
- (on a 80386 machine)
-
- Precision Control SINGLE 1.13311278820037842E+0000
- DOUBLE 1.23456789006442125E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427623555772409E+0100
- UP -1.23457760966801097E+0100
- CHOP -1.23397493540770643E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60943116855005E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.75000000000016E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals supported
- EXTENDED denormal prints as: 1.31640625000000E-4934
- Denormal should be printed as 1.3164...E-4934
-
-
- These are the results for the ULSI 83C87
-
- Precision Control SINGLE 1.23456789012337585E+0000
- DOUBLE 1.23456789012337585E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427623555772409E+0100
- UP -1.23457760966801097E+0100
- CHOP -1.23397493540770643E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60943116855005E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.75000000000016E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals supported
- EXTENDED denormal prints as: 1.31640625000000E-4934
- Denormal should be printed as 1.3164...E-4934
-
-
- These are the results for the IIT 3C87
-
- Precision Control SINGLE 1.13311278820037842E+0000
- DOUBLE 1.23456789006442125E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427623555772409E+0100
- UP -1.23457760966801097E+0100
- CHOP -1.23397493540770643E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60943116855005E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.75000000000016E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals not supported
-
-
- These are the results for the TP 6.0 coprocessor emulator:
-
- Precision Control SINGLE 1.23456789012351396E+0000
- DOUBLE 1.23456789012351396E+0000
- EXTENDED 1.23456789012351396E+0000
-
- Rounding Control NEAREST -1.23457766383395931E+0100
- DOWN -1.23457766383395931E+0100
- UP -1.23457766383395931E+0100
- CHOP -1.23457766383395931E+0100
-
- Denormal support
-
- SINGLE denormals not supported
- DOUBLE denormals not supported
- EXTENDED denormals not supported
-
-
- The test results show that the IIT 3C87 does not conform to the
- IEEE-754 floating-point standard in that it does not support
- denormals in double extended precision. The ULSI 83C87 is not
- Intel 387 compatible in that it does not support precision control,
- but allways uses double extended precision. The TP 6.0 emulator
- supports neither precision control, rounding control nor support
- for any denormals. In addition, its basic arithmetic operations
- do not seem to conform to the IEEE standard as the results of
- the test programs differ from that of any result computed by a
- coprocessor for any mode.
-
-
- With regard to the accuracy of transcendental functions, Cyrix
- claims that the relative error of the transcendental functions
- on the 83D87 never exceeds 0.5 units in the last place (0.5 ULP)
- of the double extended format [13]. This means that the maximum
- relative error is below 2**-64, while Intel's published error
- limit is 2**-62. While Intel uses a modified CORDIC algorithm
- [18,19] to compute the transcendental functions, Cyrix uses
- rational approximations that utilize a very fast array multiplier.
- For an explanation why this approach is superior to CORDIC with
- todays technology, see [61]. Also, Cyrix uses an internal 75 bit
- data path for the mantissa [15], so intermediate computations in
- the generation of transcendental function values will enjoy some
- additional accuracy over the 64 bits provided by the double
- extended format. Using 75 mantissa bits also provides an advantage
- over other coprocessors like the Intel 387DX and ULSI 83C87 which
- use only a 68 bit data path for the mantissa [58,59]. Note that a
- maximum relative error of 0.5 ULP for the Cyrix coprocessor does
- not mean that it returns the 'exact' result (machine number closest
- to infinitely precise result) all the time. Just consider the case
- where the infinitely precise result of a transcendental function
- falls nearly half way between two machine numbers. A relative error
- of 0.5 ULP can cause the result to be either of the numbers after
- rounding, depending on the direction of the error. But the 83D87
- should deliver results that never differ from the 'exact' result
- by more than one ULP. Cyrix also claims that its transcendental
- functions satisfy the monotonicity criterion [13], a claim not
- made by any of the competitors. Monotonicity means that for all
- x1 > x2, it always follows that f(x1) >= f(x2) for an increasing
- function like sin on [0..pi/4]. Likewise, for a decreasing
- function like cos on [0..pi/4], for all x1 > x2, it follows that
- f(x1) <= f(x2).
-
- The Weitek Abacus 3167 and 4167 implement only the basic arithmetic
- operations (add, subtract, negate, multiply, divide, square root)
- in hardware. Transcendental functions are provided via a software
- library provided by Weitek. For these library functions Weitek
- claims a maximum relative error of 5 ULPs [31,33] (ULP = Unit in
- the Last Place, numeric weight of the least significant mantissa
- bit). This means that the last three bits in the mantissa of a
- double precision result can be wrong. Note that the Intel 387 and
- compatible math coprocessors generate the transcendental functions
- with a small relative error with regard to the _extended double
- precision_ format. Thus, when rounded to double precision, their
- function values are nearly always 'exact'. 387 type coprocessors
- have superior accuracy when compared with Weitek's coprocesssors.
-
- The test diskette distributed with early versions of the
- Cyrix 83D87 contained a program TRANCK that checks the
- accuracy of the transcendental functions in the coprocessor
- against a more precise software arithmetic [16]. I used this
- program to compare the accuracy of the transcendental functions
- on those 287/387/486 coprocessors/FPUs available to me. As TRANCK
- will not accept negative numbers as intervall limits, I tested
- each function on an intervall along the positive x-axis. The
- functions tested are F2XM1 (2**x-1), FSIN (sine), FCOS (cosine),
- FPTAN (tangent), FPATAN (arctangent), FYL2X (y * log2 (x)),
- and FYL2XP1 (y * log2 (x+1)). These are all the transcendental
- functions implemented on the 80387. Note that the square root
- (FSQRT) is *not* a transcendental function. For every function,
- 100,000 arguments were evaluated. The arguments were uniformally
- distributed within the intervall tested. The EM87 emulator could
- not be checked with TRANCK, since the multiple precision package
- in TRANCK would always return with an error message immediately.
- However, the Franke387 could be tested and
-
-
- Test results for accuracy of transcendental functions for double
- extended precision as returned by the program TRANCK. 100,000
- trials per function.
-
- %wrong is the percentage of results that differ from the 'exact'
- result (infinitely precise result rounded to 64 bits)
- ULP_hi is the number of results where the returned result was
- greater than the 'exact' (correctly rounded) result by
- one ULP (the numeric weight of the last mantissa bit,
- 2**-64 to 2**-63 depending of the size of the number).
- ULPs_hi is the number of results where the returned result was
- greater than the 'exact' result by two or more ULPs.
- ULP_lo is the number of results where the returned result was
- smaller than the 'exact' (correctly rounded) result by
- one ULP (the numeric weight of the last mantissa bit,
- 2**-64 to 2**-63 depending of the size of the number).
- ULPs_lo is the number of results where the returned result was
- smaller than the 'exact' result by two or more ULPs.
- max ULP err is the maximum deviation of a returned result from the
- 'exact' answer expressed in ULPs.
-
-
- Franke387 V2.4 emulator
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 39.042 25301 708 13029 4 2
- COS 0,pi/4 75.714 49827 25887 0 0 3
- TAN 0,pi/4 76.976 14230 10029 24323 28394 9
- ATAN 0,1 55.826 26028 1529 24044 4225 4
- 2XM1 0,0.5 96.717 0 0 47910 48807 5
- YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8
- YL2X 0.1,10 62.252 16817 4712 37082 3641 2953
-
-
- INTEL 80287
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
- COS 0,pi/4 N/A N/A N/A N/A N/A N/A
- TAN 0,pi/4 37.001 18756 524 17405 316 2
- ATAN 0,1 9.666 6065 0 3601 0 1
- 2XM1 0,0.5 19.920 0 0 19920 0 1
- YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1
- YL2X 0.1,10 1.287 723 0 564 0 1
-
-
- INTEL 387
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 28.872 2467 0 26392 13 2
- COS 0,pi/4 27.213 27169 35 9 0 2
- TAN 0,pi/4 10.532 441 0 10091 0 1
- ATAN 0,1 7.088 2386 0 4691 1 2
- 2XM1 0,0.5 32.024 0 0 32024 0 1
- YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
- YL2X 0.1,10 13.020 6508 0 6512 0 1
-
-
- INTEL 387DX
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 28.873 2467 0 26393 13 2
- COS 0,pi/4 27.121 27090 22 9 0 2
- TAN 0,pi/4 10.711 457 0 10254 0 1
- ATAN 0,1 7.088 2386 0 4691 1 2
- 2XM1 0,0.5 32.024 0 0 32024 0 1
- YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
- YL2X 0.1,10 13.020 6508 0 6512 0 1
-
-
- ULSI 83C87
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 35.530 4989 6 30238 297 2
- COS 0,pi/4 43.989 11193 675 31393 728 2
- TAN 0,pi/4 48.539 18880 1015 26349 2295 3
- ATAN 0,1 20.858 62 0 20796 0 1
- 2XM1 0,0.5 21.257 4 0 21253 0 1
- YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2
- YL2X 0.1,10 13.603 9816 0 3787 0 1
-
-
- IIT 3C87
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 18.650 11171 0 7479 0 1
- COS 0,pi/4 7.700 3024 0 4676 0 1
- TAN 0,pi/4 20.973 9681 0 11291 1 2
- ATAN 0,1 19.280 13186 0 6094 0 1
- 2XM1 0,0.5 25.660 17570 0 8090 0 1
- YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3
- YL2X 0.1,10 10.888 5638 357 4845 48 3
-
-
- CYRIX 83D87
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 1.554 1015 0 539 0 1
- COS 0,pi/4 0.925 143 0 782 0 1
- TAN 0,pi/4 4.147 881 0 3266 0 1
- ATAN 0,1 0.656 229 0 427 0 1
- 2XM1 0,0.5 2.628 1433 0 1194 0 1
- YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1
- YL2X 0.1,10 0.931 256 0 675 0 1
-
-
- CYRIX 387+
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 1.486 864 0 622 0 1
- COS 0,pi/4 2.072 12 0 2060 0 1
- TAN 0,pi/4 0.602 63 0 539 0 1
- ATAN 0,1 0.384 12 0 372 0 1
- 2XM1 0,0.5 1.985 27 0 1958 0 1
- YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1
- YL2X 0.1,10 0.764 367 0 397 0 1
-
-
- INTEL RapidCAD, Intel 486
- max
- funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 16.991 1517 0 15474 0 1
- COS 0,pi/4 9.003 7603 0 1400 0 1
- TAN 0,pi/4 10.532 441 0 10091 0 1
- ATAN 0,1 7.078 2386 0 4691 1 2
- 2XM1 0,0.5 32.025 0 0 32025 0 1
- YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1
- YL2X 0.1,10 3.894 1879 0 2015 0 1
-
-
- The test results above indicate that all 80x87 compatibles do not
- exceed Intel's stated error bound of 3 ULPs for the transcendental
- functions. However, some coprocessors are more accurate than others.
- Rating the coprocessors according to the accuracy of their trans-
- cendental functions gives the following list (highest accuracy
- first): Cyrix 387+, Cyrix 83D87, Intel 486, Intel RapidCAD, Intel
- 80287(!), Intel 387DX, Intel 80387, IIT 3C87, ULSI 83C87. The tests
- also show that the problems with excessive inaccuracy of the trans-
- cendental functions in early versions of the IIT coprocessors with
- errors of up to 8 ULPs [8] have been eliminated. According to [56],
- certain problems with the FPATAN instruction on the IIT 3C87 occuring
- under the UNIX version of AutoCAD have been corrected in June, 1990.
- The Franke387 has acceptable accuracy for the FSIN, FCOS, and FPATAN
- instructions, taking into consideration that according to its
- documentation, Franke387 uses only 64 bits of precision for the
- intermediate results, while coprocessorsa typically use 68 bits
- and more. However, the larger error in the FPTAN, F2XM1, FYL2XP1,
- and especially the FYL2X operations show that the emulator doesn't
- use state of the art algorithms, which ensure an error of only a
- very few ULPs even if no extra precise intermediate results are
- available.
-
- References
-
- [1] Schnurer, G.: Zahlenknacker im Vormarsch.
- c't 1992, Heft 4, Seiten 170-186
- [2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark.
- Computer Journal, Vol. 19, No. 1, 1976, pp. 43-49
- [3] Wichmann, B.A.: Validation code for the Whetstone benchmark.
- NPL Report DITC 107/88, National Physics Laboratory, UK,
- March 1988
- [4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after
- 15 Years.
- In: Aad van der Steen (ed.): Evaluating Supercomputers.
- London: Chapman and Hall 1990
- [5] Dongarra, J.J.: The Linpack Benchmark: An Explanation.
- In: Aad van der Steen (ed.): Evaluating Supercomputers.
- London: Chapman and Hall 1990
- [6] Dongarra, J.J.: Performance of Various Computers Using Standard
- Linear Equations Software.
- Report CS-89-85, Computer Science Department, University of
- Tennessee, March 11, 1992
- [7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test.
- Design & Elektronik 1990, Heft 13, Seiten 105-110
- [8] Ungerer, B.: Sockelfolger.
- c't 1990, Heft 4, Seiten 162-163
- [9] Coonen, J.T.: Contributions to a Proposed Standard for Binary
- Floating-Point Arithmetic
- Ph.D. thesis, University of California, Berkeley, 1984
- [10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic.
- SIGPLAN Notices, Vol. 22, No. 2, 1985, pp. 9-25
- [11] IEEE Standard for Binary Floating-Point Arithmetic.
- ANSI/IEEE Std 754-1985.
- New York, NY: Institute of Electrical and Electronics
- Engineers 1985
- [12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989
- Order No. B2004
- [13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990
- Order No. B2002
- [14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990
- Order No. B2004
- [15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990
- Order No. L2001-003
- [16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package.
- ACM Transactions on Mathematical Software, Vol. 4, No. 1,
- March 1978, pp. 57-70
- [17] 387DX User's Manual, Programmer's Reference. Intel Corporation,
- 1989
- Order No. 231917-002
- [18] Volder, J.E.: The CORDIC Trigonometric Computing Technique.
- IRE Transactions on Electronic Computers, Vol. EC-8, No. 5,
- September 1959, pp. 330-334
- [19] Walther, J.S.: A unified algorithm for elementary functions.
- AFIPS Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385
- [20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der
- IBM 3090E mit Vektoreinrichtung.
- Arbeitsbericht RRZK-8803, Regionales Rechenzentrum an der
- Universit"at zu Köln, Februar 1988
- [21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the
- numerical performance range.
- Technical Report UCRL-53745, Lawrence Livermore National
- Laboratory, USA, December 1986
- [22] Nave, R.: Implementation of Transcendental Functions on a Numerics
- Processor.
- Microprocessing and Microprogramming, Vol. 11, No. 3-4,
- March-April 1983, pp. 221-225
- [23] Yuen, A.K.: Intel's Floating-Point Processors.
- Electro/88 Conference Record, Boston, MA, USA, 10-12 May 1988,
- pp. 48/5-1 - 48/5-7
- [24] Stiller, A.; Ungerer, B.: Ausgerechnet.
- c't 1990, Heft 1, Seiten 90-92
- [25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase?
- PC Professionell, Juni 1991, Seiten 214-237
- [26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987
- Order No.210760-002
- [27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices,
- June 1989
- Order No. 11671B/0
- [28] Intel RapidCAD(tm) Engineering CoProcessor Performance Brief.
- Intel Corporation, 1992
- [29] i486(tm) Microprocessor Performance Report. Intel Corporation,
- April 1990
- Order No. 240734-001
- [30] Intel486(tm) DX2 Microprocessor Performance Brief. Intel
- Corporation, March 1992
- Order No. 241254-001
- [31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek
- Corporation, July 1990
- DOC No. 9030
- [32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek
- Corporation, July 1989
- DOC No. 8943
- [33] Abacus Software Designer's Guide. Weitek Corporation,
- September 1989
- DOC No. 8967
- [34] Stiller, A.: Cache & Carry.
- c't 1992, Heft 6, Seiten 118-130
- [35] Stiller, A.: Cache & Carry, Teil 2.
- c't 1992, Heft 7, Seiten 28-34
- [36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der
- Numerik-Prozessoren 8087/80287.
- München: tewi 1985
- [37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation,
- September 1989
- Order No. 270640-003
- [38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
- [39] Engineering note 4x4 matrix multiply transformation. IIT, 1989
- [40] Tscheuschner, E.: 4 mal 4 auf einen Streich.
- c't 1990, Heft 3, Seiten 266-276
- [41] Goldberg, D.: Computer Arithmetic.
- In: Hennessy, J.L.; Patterson, D.A.: Computer Architecture A
- Quantitative Approach. San Mateo, CA: Morgan Kaufmann 1990
- [42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989,
- Order No. 205835-007
- [43] 8086/8088 User's Manual, Programmer's and Hardware Reference.
- Intel Corporation, 1989
- Order No. 240487-001
- [44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation,
- 1987
- Order No. 210498-005
- [45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel
- Corporation, May 1990
- Order No. 290376-001
- [46] Cyrix FasMath(tm) 82S87 Coprocessor Data Sheet. Cyrix Coporation,
- 1991
- Document 94018-00 Rev. 1.0
- [47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
- [48] 486(tm)SX(tm) Microprocessor/ 487(tm)SX(tm) Math CoProcessor
- Data Sheet. Intel Corporation, April 1991.
- Order No. 240950-001
- [49] Schnurer, G.: Die gro"se Verlade.
- c't 1991, Heft 7, Seiten 55-57
- [50] Schnurer, G.: Eine 4 f"ur alle.
- c't 1991, Heft 6, Seite 25
- [51] Intel486(tm)DX Microprocessor Data Book. Intel Corporation,
- June 1991
- Order No. 240440-004
- [52] i486(tm) Microprocessor Hardware Reference Manual. Intel
- Corporation, 1990
- Order No. 240552-001
- [53] i486(tm) Microprocessor Programmer's Reference Manual. Intel
- Corporation, 1990
- Order No. 240486-001
- [54] Ungerer, B.: Kalte H"ute.
- c't 1992, Heft 8, Seiten 140-144
- [55] Ungerer, B.: Hei"se Sache.
- c't 1991, Heft 4, Seiten 104-108
- [56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase?
- PC Profesionell, Juni 1991, Seiten 214-237
- [57] Niederkr"uger, W.: Lebendige Vergangenheit.
- c't 1990, Heft 12, Seiten 114-116
- [58] ULSI Math*Co Advanced Math Coprocessor Technical Specification.
- ULSI System, 5/92, Rev. E
- [59] 387(tm)DX Math CoProcessor Data Sheet. Intel Corporation,
- September 1990.
- Order No. 240448-003
- [60] 387(tm) Numerics Coprocessor Extension Data Sheet. Intel
- Corporation, February 1989.
- Order No. 231920-005
- [61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a
- Numerical Coprocessor Based on Rational Approximations.
- IEEE Transactions on Computers, Vol. C-39, No. 8, August 1990,
- pp. 1030-1037
- [62] 387(tm) SX Math CoProcessor Data Sheet. Intel Corporation,
- November 1989
- Order No. 240225-005
- [63] Frenkel, G.: Coprocessors Speed Numeric Operations.
- PC-Week, August 27, 1990
- [64] Schnurer, G.; Stiller, A.: Auto-Matt.
- c't 1991, Heft 10, Seiten 94-96
- [65] Grehan, R.: FPU Face-Off.
- Byte, November 1990, pp. 194-200
- [66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number
- Theory. Preprint MCS-P84-0889, Mathematics and Computer Science
- Division, Argonne National Laboratory, August 1989
- [67] Ferguson, W.E.: Selecting math coprocessors.
- IEEE Spectrum, July 1991, pp. 38-41
- [68] Schnabel, J.: Viermal 387.
- Computer Pers"onlich 1991, Heft 22, Seiten 153-156
- [69] Hofmann, J.: Starke Rechenknechte.
- mc 1990, Heft 7, Seiten 64-67
- [70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power.
- Computer Live 1991, Heft 10, Seiten 138-149
-
-
-
- Manufacturer's addresses
-
- Intel Corporation
- 3065 Bowers Avenue
- Santa Clara, CA 95051
- USA
-
- IIT Integrated Information Technology, Inc.
- 2540 Mission College Blvd.
- Santa Clara, CA 95054
- USA
-
- ULSI Systems, Inc.
- 58 Daggett Drive
- San Jose, CA 95134
- USA
-
- Chips & Technologies, Inc.
- 3050 Zanker Road
- San Jose, CA 95134
- USA
-
- Weitek Corporation
- 1060 East Arques Avenue
- Sunnyvale, CA 94086
- USA
-
- AMD Advanced Microdevices, Inc.
- 901 Thompson Place
- P.O.B. 3453
- Sunnyvale, CA 94088-3453
- USA
-
- Cyrix Corporation
- P.O.B. 850118
- Richardson, TX 75085
- USA
-
-
- Appendix A
-
-
- {$N+,E+}
- PROGRAM PCtrl;
-
- VAR B,c: EXTENDED;
- Precision, L: WORD;
-
- PROCEDURE SetPrecisionControl (Precision: WORD);
- (* This procedure sets the internal precision of the NDP. Available *)
- (* precision values: 0 - 24 bits (SINGLE) *)
- (* 1 - n.a. (mapped to single) *)
- (* 2 - 53 bits (DOUBLE) *)
- (* 3 - 64 bits (EXTENDED) *)
-
- VAR CtrlWord: WORD;
-
- BEGIN {SetPrecisionCtrl}
- IF Precision = 1 THEN
- Precision := 0;
- Precision := Precision SHL 8; { make mask for PC field in ctrl word}
- ASM
- FSTCW [CtrlWord] { store NDP control word }
- MOV AX, [CtrlWord] { load control word into CPU }
- AND AX, 0FCFFh { mask out precision control field }
- OR AX, [Precision] { set desired precision in PC field }
- MOV [CtrlWord], AX { store new control word }
- FLDCW [CtrlWord] { set new precision control in NDP }
- END;
- END; {SetPrecisionCtrl}
-
- BEGIN {main}
- FOR Precision := 1 TO 3 DO BEGIN
- B := 1.2345678901234567890;
- SetPrecisionControl (Precision);
- FOR L := 1 TO 20 DO BEGIN
- B := Sqrt (B);
- END;
- FOR L := 1 TO 20 DO BEGIN
- B := B*B;
- END;
- SetPrecisionControl (3); { full precision for printout }
- WriteLn (Precision, B:28);
- END;
- END.
-
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$N+,E+}
- PROGRAM RCtrl;
-
- VAR B,c: EXTENDED;
- RoundingMode, L: WORD;
-
-
- PROCEDURE SetRoundingMode (RCMode: WORD);
- (* This procedure selects one of four available rounding modes *)
- (* 0 - Round to nearest (default) *)
- (* 1 - Round down (towards negative infinity) *)
- (* 2 - Round up (towards positive infinity) *)
- (* 3 - Chop (truncate, round towards zero) *)
-
- VAR CtrlWord: WORD;
-
- BEGIN
- RCMode := RCMode SHL 10; { make mask for RC field in control word}
- ASM
- FSTCW [CtrlWord] { store NDP control word }
- MOV AX, [CtrlWord] { load control word into CPU }
- AND AX, 0F3FFh { mask out rounding control field }
- OR AX, [RCMode] { set desired precision in RC field }
- MOV [CtrlWord], AX { store new control word }
- FLDCW [CtrlWord] { set new rounding control in NDP }
- END;
- END;
-
- BEGIN
- FOR RoundingMode := 0 TO 3 DO BEGIN
- B := 1.2345678901234567890e100;
- SetRoundingMode (RoundingMode);
- FOR L := 1 TO 51 DO BEGIN
- B := Sqrt (B);
- END;
- FOR L := 1 TO 51 DO BEGIN
- B := -B*B;
- END;
- SetRoundingMode (0); { round to nearest for printout }
- WriteLn (RoundingMode, B:28);
- END;
- END.
-
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$N+,E+}
-
- PROGRAM DenormTs;
-
- VAR E: EXTENDED;
- D: DOUBLE;
- S: SINGLE;
-
- BEGIN
- WriteLn ('Testing support and printing of denormals');
- WriteLn;
- Write ('Coprocessor is: ');
- CASE Test8087 OF
- 0: WriteLn ('Emulator');
- 1: WriteLn ('8087 or compatible');
- 2: WriteLn ('80287 or compatible');
- 3: WriteLn ('80387 or compatible');
- END;
- WriteLn;
- S := 1.18e-38;
- S := S * 3.90625e-3;
- IF S = 0 THEN
- WriteLn ('SINGLE denormals not supported')
- ELSE BEGIN
- WriteLn ('SINGLE denormals supported');
- WriteLn ('SINGLE denormal prints as: ', S);
- WriteLn ('Denormal should be printed as 4.60943...E-0041');
- END;
- WriteLn;
- D := 2.24e-308;
- D := D * 3.90625e-3;
- IF D = 0 THEN
- WriteLn ('DOUBLE denormals not supported')
- ELSE BEGIN
- WriteLn ('DOUBLE denormals supported');
- WriteLn ('DOUBLE denormal prints as: ', D);
- WriteLn ('Denormal should be printed as 8.75...E-0311');
- END;
- WriteLn;
- E := 3.37e-4932;
- E := E * 3.90625e-3;
- IF E = 0 THEN
- WriteLn ('EXTENDED denormals not supported')
- ELSE BEGIN
- WriteLn ('EXTENDED denormals supported');
- WriteLn ('EXTENDED denormal prints as: ', E);
- WriteLn ('Denormal should be printed as 1.3164...E-4934');
- END;
- END.
-
-
- Appendix B
-
-
- ; FILE: APFELM4.ASM
- ; assemble with MASM /e APFELM4 or TASM /e APFELM4
-
-
- CODE SEGMENT BYTE PUBLIC 'CODE'
- ASSUME CS: CODE
-
- PAGE ,120
-
- PUBLIC APPLE87;
-
- APPLE87 PROC NEAR
- PUSH BP ; save caller's base pointer
- MOV BP, SP ; make new frame pointer
- PUSH DS ; save caller's data segment
- PUSH SI ; save register
- PUSH DI ; variables
- LDS BX, [BP+04] ; pointer to parameter record
- FINIT ; init 80x87 FSP->R0
- FILD WORD PTR [BX+02] ; maxrad FSP->R7
- FLD QWORD PTR [BX+08] ; qmax FSP->R6
- FSUB QWORD PTR [BX+16] ; qmax-qmin FSP->R6
- DEC WORD PTR [BX+04] ; ymax-1
- FIDIV WORD PTR [BX+04] ; (qmax-qmin)/(ymax-1)FSP->R6
- FSTP QWORD PTR [BX+16] ; save delta_q FSP->R7
- FLD QWORD PTR [BX+24] ; pmax FSP->R6
- FSUB QWORD PTR [BX+32] ; pmax-pmin FSP->R6
- DEC WORD PTR [BX+06] ; xmax-1
- FIDIV WORD PTR [BX+06] ; delta_p FSP->R6
- MOV AX, [BX] ; save maxiter,[BX] needed for
- MOV [BX+2], AX ; 80x87 status now
- XOR BP, BP ; y=0
- FLD QWORD PTR [BX+08] ; qmax FSP->R5
- CMP WORD PTR [BX+40], 0 ; fast mode on 8087 desired ?
- JE yloop ; no, normal mode
- FSTCW [BX] ; save NDP control word
- AND WORD PTR [BX], 0FCFFh; set PCTRL = single precision
- FLDCW [BX] ; get back NDP control word
- yloop: XOR DI, DI ; x=0
- FLD QWORD PTR [BX+32] ; pmin FSP->R4
- xloop: FLDZ ; j**2= 0 FSP->R3
- FLDZ ; 2ij = 0 FSP->R2
- FLDZ ; i**2= 0 FSP->R1
- MOV CX, [BX+2] ; maxiter
- MOV DL, 41h ; mask for C0 and C3 cond.bits
- iteration: FSUB ST, ST(2) ; i**2-j**2 FSP->R1
- FADD ST, ST(3) ; i**2-j**2+p = i FSP->R1
- FLD ST(0) ; duplicate i FSP->R0
- FMUL ST(1), ST ; i**2 FSP->R0
- FADD ST, ST(0) ; 2i FSP->R0
- FXCH ST(2) ; 2*i*j FSP->R0
- FADD ST, ST(5) ; 2*i*j+q = j FSP->R0
- FMUL ST(2), ST ; 2*i*j FSP->R0
- FMUL ST, ST(0) ; j**2 FSP->R0
- FST ST(3) ; save j**2 FSP->R0
- FADD ST, ST(1) ; i**2+j**2 FSP->R0
- FCOMP ST(7) ; i**2+j**2 > maxrad? FSP->R1
- FSTSW [BX] ; save 80x87 cond.codeFSP->R1
- TEST BYTE PTR [BX+1], DL ; test carry and zero flags
- LOOPNZ iteration ; until maxiter if not diverg.
- MOV DX, CX ; number of loops executed
- NEG CX ; carry set if CX <> 0
- ADC DX, 0 ; adjust DX if no. of loops<>0
-
- ; plot point here (DI = X, BP = y, DX has the color)
-
- FSTP ST(0) ; pop i**2 FSP->R2
- FSTP ST(0) ; pop 2ij FSP->R3
- FSTP ST(0) ; pop j**2 FSP->R4
- FADD ST,ST(2) ; p=p+delta_p FSP->R4
- INC DI ; x:=x+1
- CMP DI, [BX+6] ; x > xmax ?
- JBE xloop ; no, continue on same line
- FSTP ST(0) ; pop p FSP->R5
- FSUB QWORD PTR [BX+16] ; q=q-delta_q FSP->R5
- INC BP ; y:=y+1
- CMP BP, [BX+4] ; y > ymax ?
- JBE yloop ; no, picture not done yet
-
- groesser: POP DI ; restore
- POP SI ; register variables
- POP DS ; restore caller's data segm.
- POP BP ; save caller's base pointer
- RET 4 ; pop parameters and return
- APPLE87 ENDP
-
- CODE ENDS
-
- END
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- UNIT Time;
-
- INTERFACE
-
- FUNCTION Clock: LONGINT; { same as VMS; time in milliseconds }
-
-
- IMPLEMENTATION
-
- FUNCTION Clock: LONGINT; ASSEMBLER;
- ASM
- PUSH DS { save caller's data segment }
- XOR DX, DX { initialize data segment to }
- MOV DS, DX { access ticker counter }
- MOV BX, 46Ch { offset of ticker counter in segm.}
- MOV DX, 43h { timer chip control port }
- MOV AL, 4 { freeze timer 0 }
- PUSHF { save caller's int flag setting }
- STI { allow update of ticker counter }
- LES DI, DS:[BX] { read BIOS ticker counter }
- OUT DX, AL { latch timer 0 }
- LDS SI, DS:[BX] { read BIOS ticker counter }
- IN AL, 40h { read latched timer 0 lo-byte }
- MOV AH, AL { save lo-byte }
- IN AL, 40h { read latched timer 0 hi-byte }
- POPF { restore caller's int flag }
- XCHG AL, AH { correct order of hi and lo }
- MOV CX, ES { ticker counter 1 in CX:DI:AX }
- CMP DI, SI { ticker counter updated ? }
- JE @no_update { no }
- OR AX, AX { update before timer freeze ? }
- JNS @no_update { no }
- MOV DI, SI { use second }
- MOV CX, DS { ticker counter }
- @no_update:NOT AX { counter counts down }
- MOV BX, 36EDh { load multiplier }
- MUL BX { W1 * M }
- MOV SI, DX { save W1 * M (hi) }
- MOV AX, BX { get M }
- MUL DI { W2 * M }
- XCHG BX, AX { AX = M, BX = W2 * M (lo) }
- MOV DI, DX { DI = W2 * M (hi) }
- ADD BX, SI { accumulate }
- ADC DI, 0 { result }
- XOR SI, SI { load zero }
- MUL CX { W3 * M }
- ADD AX, DI { accumulate }
- ADC DX, SI { result in DX:AX:BX }
- MOV DH, DL { move result }
- MOV DL, AH { from DL:AX:BX }
- MOV AH, AL { to }
- MOV AL, BH { DX:AX:BH }
- MOV DI, DX { save result }
- MOV CX, AX { in DI:CX }
- MOV AX, 25110 { calculate correction }
- MUL DX { factor }
- SUB CX, DX { subtract correction }
- SBB DI, SI { factor }
- XCHG AX, CX { result back }
- MOV DX, DI { to DX:AX }
- POP DS { restore caller's data segment }
- END;
-
-
- BEGIN
- Port [$43] := $34; { need rate generator, not square wave}
- Port [$40] := 0; { generator as prog. by some BIOSes }
- Port [$40] := 0; { for timer 0 }
- END. { Time }
-
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$A+,B-,R-,I-,V-,N+,E+}
- PROGRAM PeakFlop;
-
- USES Time;
-
- TYPE ParamRec = RECORD
- MaxIter, MaxRad, YMax, XMax: WORD;
- Qmax, Qmin, Pmax, Pmin: DOUBLE;
- FastMod: WORD;
- PlotFkt: POINTER;
- FLOPS:LONGINT;
- END;
-
- VAR Param: ParamRec;
- Start: LONGINT;
-
-
- {$L APFELM4.OBJ}
-
- PROCEDURE Apple87 (VAR Param: ParamRec); EXTERNAL;
-
-
- BEGIN
- WITH Param DO BEGIN
- MaxIter:= 50;
- MaxRad := 30;
- YMax := 30;
- XMax := 30;
- Pmin :=-2.1;
- Pmax := 1.1;
- Qmin :=-1.2;
- Qmax := 1.2;
- FastMod:= Word (FALSE);
- PlotFkt:= NIL;
- Flops := 0;
- END;
- Start := Clock;
- Apple87 (Param); { executes 104002 FLOP }
- Start := Clock - Start; { elapsed time in milliseconds }
- WriteLn ('Peak-MFLOPS: ', 104.002 / Start);
- END.
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- ; FILE: M4X4.ASM
- ;
- ; assemble with TASM /e M4X4 or MASM /e M4X4
-
- CODE SEGMENT BYTE PUBLIC 'CODE'
-
- ASSUME CS:CODE
-
- PUBLIC MUL_4x4
- PUBLIC IIT_MUL_4x4
-
-
- FSBP0 EQU DB 0DBh, 0E8h ; declare special IIT
- FSBP1 EQU DB 0DBh, 0EBh ; instructions
- FSBP2 EQU DB 0DBh, 0EAh
- F4X4 EQU DB 0DBh, 0F1h
-
-
- ;---------------------------------------------------------------------
- ;
- ; MUL_4x4 multiplicates a four-by-four matrix by an array of four
- ; dimensional vectors. This operation is needed for 3D transformations
- ; in graphics data processing. There are arrays for each component of
- ; a vector. Thus there is an ; array containing all the x components,
- ; another containing all the y components and so on. Each component is
- ; an 8 byte IEEE floating point number. Two indices into the array of
- ; vectors are given. The first is the index of the vector that will be
- ; processed first, the second is the index of the vector processed
- ; last.
- ;
- ;---------------------------------------------------------------------
-
- MUL_4x4 PROC NEAR
-
- AddrX EQU DWORD PTR [BP+24] ; address of X component array
- AddrY EQU DWORD PTR [BP+20] ; address of Y component array
- AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
- AddrW EQU DWORD PTR [BP+12] ; address of W component array
- AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transform. mat.
- F EQU WORD PTR [BP+6] ; first vector to process
- K EQU WORD PTR [BP+4] ; last vector to process
- RetAddr EQU WORD PTR [BP+2] ; return address saved by call
- SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
- SavdDS EQU WORD PTR [BP-2] ; caller's data segment
-
- PUSH BP ; save TURBO-Pascal frame pointer
- MOV BP, SP ; new frame pointer
- PUSH DS ; save TURBO-Pascal data segment
-
- MOV CX, K ; final index
- SUB CX, F ; final index - start index
- JNC $ok ; must not
- JMP $nothing ; be negative
- $ok: INC CX ; number of elements
-
- MOV SI, F ; init offset into arrays
- SHL SI, 1 ; each
- SHL SI, 1 ; element
- SHL SI, 1 ; has 8 bytes
-
- LDS DI, AddrT ; addr. of transformation mat.
- FLD QWORD PTR [DI] ; load a[0,0] = R7
- FLD QWORD PTR [DI+8] ; load a[0,1] = R6
-
- $mat_mul: LES BX, AddrX ; addr. of x component array
- FLD QWORD PTR ES:[BX+SI] ; load x[a] = R5
- LES BX, AddrY ; addr. of y component array
- FLD QWORD PTR ES:[BX+SI] ; load y[a] = R4
- LES BX, AddrZ ; addr. of z component array
- FLD QWORD PTR ES:[BX+SI] ; load z[a] = R3
- LES BX, AddrW ; addr. of w component array
- FLD QWORD PTR ES:[BX+SI] ; load w[a] = R2
-
- FLD ST(5) ; load a[0,0] = R1
- FMUL ST, ST(4) ; a[0,0] * x[a] = R1
- FLD ST(5) ; load a[0,1] = R0
- FMUL ST, ST(4) ; a[0,1] * y[a] = R0
- FADDP ST(1), ST ; a[0,0]*x[a]+a[0,1]*y[a]=R1
- FLD QWORD PTR [DI+16] ; load a[0,2] = R0
- FMUL ST, ST(3) ; a[0,2] * z[a] = R0
- FADDP ST(1), ST ; a[0,0]*x[a]...a[0,2]*z[a]=R1
- FLD QWORD PTR [DI+24] ; load a[0,3] = R0
- FMUL ST, ST(2) ; a[0,3] * w[a] = R0
- FADDP ST(1), ST ; a[0,0]*x[a]...a[0,3]*w[a]=R1
- LES BX, AddrX ; get address of x vector
- FSTP QWORD PTR ES:[BX+SI] ; write new x[a]
-
- FLD QWORD PTR [DI+32] ; load a[1,0] = R1
- FMUL ST, ST(4) ; a[1,0] * x[a] = R1
- FLD QWORD PTR [DI+40] ; load a[1,1] = R0
- FMUL ST, ST(4) ; a[1,1] * y[a] = R0
- FADDP ST(1), ST ; a[1,0]*x[a]+a[1,1]*y[a]=R1
- FLD QWORD PTR [DI+48] ; load a[1,2] = R0
- FMUL ST, ST(3) ; a[1,2] * z[a] = R0
- FADDP ST(1), ST ; a[1,0]*x[a]...a[1,2]*z[a]=R1
- FLD QWORD PTR [DI+56] ; load a[1,3] = R0
- FMUL ST, ST(2) ; a[1,3] * w[a] = R0
- FADDP ST(1), ST ; a[1,0]*x[a]...a[1,3]*w[a]=R1
- LES BX, AddrY ; get address of y vector
- FSTP QWORD PTR ES:[BX+SI] ; write new y[a]
-
- FLD QWORD PTR [DI+64] ; load a[2,0] = R1
- FMUL ST, ST(4) ; a[2,0] * x[a] = R1
- FLD QWORD PTR [DI+72] ; load a[2,1] = R0
- FMUL ST, ST(4) ; a[2,1] * y[a] = R0
- FADDP ST(1), ST ; a[2,0]*x[a]+a[2,1]*y[a]=R1
- FLD QWORD PTR [DI+80] ; load a[2,2] = R0
- FMUL ST, ST(3) ; a[2,2] * z[a] = R0
- FADDP ST(1), ST ; a[2,0]*x[a]...a[2,2]*z[a]=R1
- FLD QWORD PTR [DI+88] ; load a[2,3] = R0
- FMUL ST, ST(2) ; a[2,3] * w[a] = R0
- FADDP ST(1), ST ; a[2,0]*x[a]...a[2,3]*w[a]=R1
- LES BX, AddrZ ; get address of z vector
- FSTP QWORD PTR ES:[BX+SI] ; write new z[a]
-
- FLD QWORD PTR [DI+96] ; load a[3,0] = R1
- FMULP ST(4), ST ; a[3,0] * x[a] = R5
- FLD QWORD PTR [DI+104] ; load a[3,1] = R1
- FMULP ST(3), ST ; a[3,1] * y[a] = R4
- FLD QWORD PTR [DI+112] ; load a[3,2] = R1
- FMULP ST(2), ST ; a[3,2] * z[a] = R3
- FLD QWORD PTR [DI+120] ; load a[3,3] = R1
- FMULP ST(1), ST ; a[3,3] * w[a] = R2
- FADDP ST(1), ST ; a[3,3]*w[a]+a[3,2]*z[a]=R3
- FADDP ST(1), ST ; a[3,3]*w[a]...a[3,1]*y[a]=R4
- FADDP ST(1), ST ; a[3,3]*w[a]...a[3,0]*x[a]=R5
- LES BX, AddrW ; get address of w vector
- FSTP QWORD PTR ES:[BX+SI] ; write new w[a]
-
- ADD SI, 8 ; new offset into arrays
- DEC CX ; decrement element counter
- JZ $done ; no elements left, done
- JMP $mat_mul ; transform next vector
-
- $done: FSTP ST(0) ; clear
- FSTP ST(0) ; FPU stack
- $nothing: POP DS ; restore TP data segment
- POP BP ; restore TP frame pointer
- RET 24 ; pop parameters and return
-
- MUL_4X4 ENDP
-
-
- ;---------------------------------------------------------------------
- ;
- ; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four
- ; dimensional vectors. This operation is needed for 3D transformations
- ; in graphics data processing. There are arrays for each component of
- ; a vector. Thus there is an array containing all the x components,
- ; another containing all the y components and so on. Each component is
- ; an 8 byte IEEE floating point number. Two indices into the array of
- ; vectors are given. The first is the index of the vector that will be
- ; processed first, the second is the index of the vector processed
- ; last. This subroutine uses the special instructions only available
- ; on IIT coprocessors to provide fast matrix multiply capabilities.
- ; So make sure to use it only on IIT coprocessors.
- ;
- ;---------------------------------------------------------------------
-
- IIT_MUL_4x4 PROC NEAR
-
- AddrX EQU DWORD PTR [BP+24] ; address of X component array
- AddrY EQU DWORD PTR [BP+20] ; address of Y component array
- AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
- AddrW EQU DWORD PTR [BP+12] ; address of W component array
- AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transf. matrix
- F EQU WORD PTR [BP+6] ; first vector to process
- K EQU WORD PTR [BP+4] ; last vector to process
- RetAddr EQU WORD PTR [BP+2] ; return address saved by call
- SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
- SavdDS EQU WORD PTR [BP-2] ; caller's data segment
- Ctrl87 EQU WORD PTR [BP-4] ; caller's 80x87 control word
-
- PUSH BP ; save TURBO-Pascal frame ptr
- MOV BP, SP ; new frame pointer
- PUSH DS ; save TURBO-Pascal data seg.
- SUB SP, 2 ; make local variabe
- FSTCW [Ctrl87] ; save 80x87 ctrl word
- LES SI, AddrT ; ptr to transformation matrix
- FINIT ; initialize coprocessor
- FSBP2 ; set register bank 2
- FLD QWORD PTR ES:[SI] ; load a[0,0]
- FLD QWORD PTR ES:[SI+32] ; load a[1,0]
- FLD QWORD PTR ES:[SI+64] ; load a[2,0]
- FLD QWORD PTR ES:[SI+96] ; load a[3,0]
- FLD QWORD PTR ES:[SI+8] ; load a[0,1]
- FLD QWORD PTR ES:[SI+40] ; load a[1,1]
- FLD QWORD PTR ES:[SI+72] ; load a[2,1]
- FLD QWORD PTR ES:[SI+104] ; load a[3,1]
- FINIT ; initialize coprocessor
- FSBP1 ; set register bank 1
- FLD QWORD PTR ES:[SI+16] ; load a[0,2]
- FLD QWORD PTR ES:[SI+48] ; load a[1,2]
- FLD QWORD PTR ES:[SI+80] ; load a[2,2]
- FLD QWORD PTR ES:[SI+112] ; load a[3,2]
- FLD QWORD PTR ES:[SI+24] ; load a[0,3]
- FLD QWORD PTR ES:[SI+56] ; load a[1,3]
- FLD QWORD PTR ES:[SI+88] ; load a[2,3]
- FLD QWORD PTR ES:[SI+120] ; load a[3,3]
-
- ; transformation matrix loaded
-
- MOV AX, F ; index of first vector
- MOV DX, K ; index of last vector
-
- MOV BX, AX ; index 1st vector to process
- MOV CL, 3 ; component has 8 (2**3) bytes
- SHL BX, CL ; compute offset into arrays
-
- FINIT ; initialize coprocessor
- FSBP0 ; set register bank 0
-
- $mat_loop:LES SI, AddrW ; addr. of W component array
- FLD QWORD PTR ES:[SI+BX] ; W component current vector
- LES SI, AddrZ ; addr. of Z component array
- FLD QWORD PTR ES:[SI+BX] ; Z component current vector
- LES SI, AddrY ; addr. of Y component array
- FLD QWORD PTR ES:[SI+BX] ; Y component current vector
- LES SI, AddrX ; addr. of X component array
- FLD QWORD PTR ES:[SI+BX] ; X component current vector
- F4X4 ; mul 4x4 matrix by 4x1 vector
- INC AX ; next vector
- MOV DI, AX ; next vector
- SHL DI, CL ; offset of vector into arrays
-
- FSTP QWORD PTR ES:[SI+BX] ; store X comp. of curr. vect.
- LES SI, AddrY ; address of Y component array
- FSTP QWORD PTR ES:[SI+BX] ; store Y comp. of curr. vect.
- LES SI, AddrZ ; address of Z component array
- FSTP QWORD PTR ES:[SI+BX] ; store Z comp. of curr. vect.
- LES SI, AddrW ; address of W component array
- FSTP QWORD PTR ES:[SI+BX] ; store W comp. of curr. vect.
-
- MOV BX, DI ; ofs nxt vect. in comp. arrays
- CMP AX, DX ; nxt vector past upper bound?
- JLE $mat_loop ; no, transform next vector
- FLDCW [Ctrl87] ; restore orig 80x87 ctrl word
-
- ADD SP, 2 ; get rid of local variable
- POP DS ; restore TP data segment
- POP BP ; restore TP frame pointer
- RET 24 ; pop parameters and return
- IIT_MUL_4x4 ENDP
-
- CODE ENDS
-
- END
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$N+,E+}
-
- PROGRAM Trnsform;
-
- USES Time;
-
- CONST VectorLen = 8190;
-
- TYPE Vector = ARRAY [0..VectorLen] OF DOUBLE;
- VectorPtr = ^Vector;
- Mat4 = ARRAY [1..4, 1..4] OF DOUBLE;
-
- VAR X: VectorPtr;
- Y: VectorPtr;
- Z: VectorPtr;
- W: VectorPtr;
- T: Mat4;
- K: INTEGER;
- L: INTEGER;
- First: INTEGER;
- Last: INTEGER;
- Start: LONGINT;
- Elapsed:LONGINT;
-
- PROCEDURE MUL_4X4 (X, Y, Z, W: VectorPtr;
- VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
- PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr;
- VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
-
- {$L M4X4.OBJ}
-
- BEGIN
- WriteLn ('Test8087 = ', Test8087);
- New (X);
- New (Y);
- New (Z);
- New (W);
- FOR L := 1 TO VectorLen DO BEGIN
- X^ [L] := Random;
- Y^ [L] := Random;
- Z^ [L] := Random;
- W^ [L] := Random;
- END;
- X^ [0] := 1;
- Y^ [0] := 1;
- Z^ [0] := 1;
- W^ [0] := 1;
- FOR K := 1 TO 4 DO BEGIN
- FOR L := 1 TO 4 DO BEGIN
- T [K, L] := (K-1)*4 + L;
- END;
- END;
- First := 0;
- Last := 8190;
- Start := Clock;
- MUL_4X4 (X, Y, Z, W, T, First, Last);
- { IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); }
- Elapsed := Clock - Start;
- WriteLn ('Number of vectors: ', Last-First+1);
- WriteLn ('Time: ', Elapsed, ' ms');
- WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/
- (Elapsed*1e-3):0:4, ' MFLOPS');
- WriteLn;
- WriteLn ('Last vector:');
- WriteLn;
- WriteLn (X^[Last]);
- WriteLn (Y^[Last]);
- WriteLn (Z^[Last]);
- WriteLn (W^[Last]);
- END.