Compu-Fix

home *** CD-ROM | disk | FTP | other *** search

/ Compu-Fix / Compu-Fix.iso / referenz / text / primer.87 < prev next >

Wrap

Text File | 1993-03-01 | 224.5 KB | 3,715 lines

WHAT YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS *************************************************************** This document has been created to provide the net.community with some detailed information about mathematical coprocessors for the Intel 80x86 CPU family. It may also help to answer some of the FAQs (frequently asked questions) about this topic. The focus of this document is on 387 compatible chips, but there is also some information on the other chips in the 80x87 family and the Weitek coprocessors. Care was taken to make the information included as accurate as possible. If you think you have discovered erroneous information in this text, or think that a certain detail needs to be clarified, or want to suggest additions to this text, feel free to contact me at: S_JUFFA@IRAVCL.IRA.UKA.DE or at my snail mail address: Norbert Juffa Wielandtstr. 14 7500 Karlsruhe 1 Germany CONTENTS of this document 1) What are math coprocessors? 2) What applications benefit from using a math coprocessor 3) Installing a math coprocessor 4) Description of available math coprocessors, special features, available speeds, packaging, power consumption 5) Price information 6) How do math coprocessors work 7) Performance comparison of math coprocessors 8) Test for IEEE-754 conformance and accuracy of transcendental functions for different math coprocessors 9) References (literature) 10)Addresses of manufacturers of math coprocessors 11)Appendix A: Test programs for partial compatibility checks 12)Appendix B: Benchmark programs TRNSFORM and PEAKFLOP What are math coprocessors? A coprocessor in the traditional sense is a processor that extends the capabilities of a CPU in a transparent manner. This means that from the programmer's view the CPU and coprocessor together look like one machine. The 80x87 math coprocessors are typical examples of such coprocessors. The 80x86 CPUs (with the exception of the 80486, which has a built-in 'coprocessor') can only handle 8, 16, or 32 bit integers as their primary data types. However, many applications require the use of floating-point numbers. Simply put, use of floating point numbers enables one to express not only integers, but also fractional values over a wide range. The most common application of floating point numbers is in scientific applications, where very small (e.g. Planck's constant) and very large numbers (e.g. speed of light) have to be expressed. But floating-point numbers are also useful for business applications such as computing interest. Since the 80x86 CPUs do not support floating-point numbers or operations on them directly, they have to be programmed using the CPU's integer capabilities. This results in slow computations when floating-point numbers are used. This is where the 80x87 coprocessors come in. Adding a 80x87 to a 80x86 based system augments the CPU architecture with eight floating point registers, five additional data types and over 70 additional mnemonics. This greatly enhances the system's capability to do floating-point computations, as the coprocessor is specifically designed to handle floating-point numbers efficiently. Like most things in life, floating-point arithmetic has been standardized. The relevant standard, to which I will refer quite often in this document, is IEEE-754 Standard for Binary Floating-Point Arithmetic [10,11]. The standard specifies numeric formats, value sets and how the basic arithmetic (+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this document claim full or at least partial compliance with this standard. When browsing the literature for information on math coprocessors, you will also encounter quite a few acronyms that refer to them: MCP (Math CoProcessor), NDP (Numerical Data Processor), NPX (Numerical Processor eXtension), FPU (Floating Point Unit). The latter usually refers to the 'built-in coprocessor' of the i486. The only data type the 80x87 coprocessors (and the 80486 floating point unit, or FPU) can hold in their registers is an 80-bit long floating-point number. This data type (called temporary real or double extended precision) can represent numbers which range in size between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932 including denormal numbers) where the '^' denotes the power operator. For those familiar with floating point formats, this format has 64 mantissa bits, 15 exponent bits and 1 sign bit for the total of 80 bits. This format provides a precision of about 19 decimal places. The 80x87 can handle additional data types that are converted to/from the internal format upon being loaded/ stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit integers as well as a 18 digit BCD (binary coded decimal) occupying 10 bytes and two additional floating point types. The short real data type, also called single precision, has 32 bits that split into 23 mantissa bits, 8 exponent bit and a sign bit. This format provides a precision of about 6-7 decimal places and can represent numbers between 1.17*10^-38 and 3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long real, or double precision, data type has 64 bits, consisting of 52 mantissa bits, 11 exponent bits and the sign bit. It provides 15-16 decimal digits of precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-324 to 1.79*10^308 including denormal numbers). In addition to load/store the above mentioned operand types, the 80x87 coprocessors can perform all the basic arithmetic operation on floating point numbers. Besides 'knowing' how to add, subtract, multiply and divide they can also compare floating-point numbers, change the sign, take the square root or absolute value, compute the remainder and compute some of the transcendental functions, like the logarithm. The eight registers in the 80x87 are organized in a stack-like manner which takes some time getting used to if one programs the coprocessor directly in assembler. However, nowadays the compilers or interpreters for most high level languages (HLL) can give the programmer access to the coprocessor's data types and use their instructions, so there is not much need to worry about the rather unusual architecture of the 80x87. Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in that they do not transparently extended the CPU architecture. Rather they could be described as special memory mapped IO-devices. Since the term coprocessor has been traditionally used for these chips, they are also called by that term in this document. The architecture of the Weitek chips differs significantly from the 80x87. The Weitek's register file consists of 31 32-bit register, each one capable of holding an IEEE single precision number. Pairs of consecutive single precision registers can also be used as 64-bit IEEE double precision register. Thus there are 15 double precision registers. The Weitek register file has the standard organization known from other registers files like those in the 80386, not the special stack-like organization of the 80x87 coprocessors. The Weitek coprocessors have been tuned for maximum performance. Therefore, only a small instruction set has been implemented, but each instruction executes at a very high speed, usually only a few clock cycles each. Instructions available are load/store, add, subtract, subtract reverse, multiply, multiply and negate, multiply and accumulate, multiply and take absolute value, divide reverse, negate, absolute value, compare/test, convert fix/float, and square root. Note that the Weitek Abacus does not support a double extended format, has no built-in transcendental functions, and does not support denormals. The ressources required to implement such features have instead been devoted to implement the basic arithmetic operations as fast as possible. While the 80x87 coprocessors perform all internal calculations in double extended precision and therefore have about the same performance for single and double precision calculations, the Weitek features explicitly single and double precision operations. For applications that require only single precision the Weitek provides additional performance that way, as single precision operations are about twice as fast as their double precision counterparts. Since the Weitek Abacus has more registers than the 80x87 coprocessors, values can be kept in registers more often and have to be loaded from memory less frequently. This also leads to a performance gain. To the CPU, the Weitek Abacus looks like a 64 kB block of memory starting at physical address 0C0000000h. Every address in this range corresponds to a coprocessor instruction. Accessing a specified memory location within this block with the 386/486's MOV instruction causes the corresponding instruction to be executed. The instructions have been assigned to memory locations in such a way that loads to consecutive coprocessor registers can make use of the 386/486 MOVS string instruction. The memory mapped interface of the Weitek coprocessors is much faster than the IO-oriented protocol that is used to couple the CPU to the 80287 and 80387 coprocessors. The Weitek's starting address of 0C0000000h is only a physical address. The Weitek's memory block can be assigned to any logical address using the MMU (memory managment unit) in the 386/486's protected and virtual modes. This also means that the Weitek Abacus 3167 and 4167 can *not* be used in the real mode of those processors, since the physical start address of the Weitek coprocessors is not within the 1 MByte address range and the MMU is inoperable in real mode. However, DOS programs can make use of the Weitek Abacus by using a DOS extender or a memory manager like EMM386 that run in protected/virtual mode themself and can therefore map the Weitek's memory block to any desired location in the 1 MByte address range. Typically the FS segment register is then set up to point to the Weitek's memory block. The Weitek Abacus 3167 and 4167 are also supported by the UNIX operating system [33]. What applications will profit by using a math coprocessor? According to the Intel 387DX User's Guide, there are more than 2100 commercial programs that can make use of a 387 compatible coprocessor. Every program that uses floating point arithmetic somewhere and supports a 80x87 coprocessor can gain speed by installing a coprocessor. However, the speedup will vary from program to program and even within the same program depending on how computation intensive the program or operation within the program is. Typical applications that benefit from the use of a 80x87 coprocessor are: - Business graphics programs, such as Arts&Letters, Freedom of Press, and Freelance - Spreadsheet programs like Lotus 1-2-3, Excel, Quattro, and Wingz - CAD programs such as AutoCAD, VersaCAD, and GenericCAD - Database programs such as dBase IV, FoxBase, and Paradox - Math and Science programs such as Mathematica, TKSolver, SPSS/PC, and Statgraphics Note that for spreadsheets and databases, a coprocessor only helps if some kind of floating point computations is performed. This is true more often for spreadsheets than for data bases. Also note that the speed of many programs depends quite heavily on the speed of the graphics adaptor (CAD) or the disk performance (databases), so the computational performance is only a (small) part of the total performance of the application. There are some programs that won't run without a coprocessor, among them AutoCAD R10 and later and Mathematica. GUIs (graphical user interfaces) such as Windows do *not* gain additional speed from using a *mathematical* coprocessor, since their graphics operations only use integer arithmetic. They benefit from a graphics board with a graphical 'coprocessor' though that speed up certain common operations such as BitBlt or line drawing. However, applications running under Windows may take advantage of a math coprocessor, e.g. Excel. While support for 80x87 coprocessors is very common in application programs, the Weitek Abacus coprocessors do not enjoy such wide spread support. Due to their high price, only a few high-end PCs have been equipped with Weitek coprocessors. Therefore most of the programs that support these coprocessors are also high-end products like AutoCAD and Versacad-386. Installing a math coprocessor Usually, installing a coprocessor doesn't pose much of a problem, as every coprocessor comes with installation instructions and a diagnostic disk that lets you check for correct operation once the coprocessor has been installed. In addition, the user manuals of most computers have a section on coprocessor installation. 1) Make sure to get the right coprocessor for your system. An 8087 works together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or compatible works together with a 80286 CPU. There are also some old 386 motherboards that accept a 80287 coprocessor, but they usually also provide a socket for the 387 and I recommend to get a 387 then for use with these systems. A 80387, 387DX or compatible coprocessor is for 386 based systems, as is the Intel RapidCAD. 387 coprocessors also work together with Cyrix' 486DLC CPU which despite its name does not include an FPU. Similarly, the 387SX or compatible coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC. The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC socket in your system. Some computers, such as IBM's PS/2s don't have this socket. The Weitek Abacus 4167 works together with the 486 and requires the appropriate 142-pin socket to be present. Always install a coprocessor that is rated at the same speed as the CPU. That is, for a 40 MHz 386 system using AMD Am386-40, install a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, IIT 3C87-40, or ULSI 83C87-40. Running a coprocessor above its specified frequency rating may cause it to produce false results, which you might fail to recognize as such. I have personally experienced this problem with a Cyrix 83D87-33 that I tried to push to 40 MHz. It passed all the diagnostic benchmarks on the Cyrix diagnostic disk and the tests of some commercial system test programs. However, I found it to fail the Whetstone and Linpack benchmarks, which include accuracy checks. So although there is usually no problem with overheating when pushing a coprocessor over the specified maximum frequency rating, be warned that operation of a coprocessor above the maximum ratings stated by the manufacturer makes operation unreliable. Some 386 boards allow the coprocessor to be clocked differently than the CPU. This is called asynchronous operation and allows you to run the coprocessor at 33 MHz while the CPU runs at 40 MHz, for example. Please note that only the Intel 80387 and 387DX support asynchronous operation. The 387 'clones' from Cyrix, IIT and ULSI always run at the full speed of the CPU, even if you have set up your motherboard for asynchronous operation. 2) Once you've got the correct coprocessor for your system you can start the actual installation process: - turn off the computer's power switch and unplug the power cord from the wall outlet - remove the cover of your computer - locate the math coprocessor socket. This socket is located right next to the CPU, which can be identified by the printing on top of the chip. The CPU usually is one of the biggest chips on the board. The 8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on each of the longer sides. The 387SX PLCC socket is a square socket that has 17 vertical connector strips on the 'wall' of each side. The 387 PGA socket is square and has two rows of pin holes on each side. The EMC socket is similar but has three rows of holes on each side. The PGA socket for the Weitek 4167 is also square with three rows of holes on each side. If the CPU and coprocessor socket is on a separate card rather than on the motherboard (typical for modular systems), you have to remove the card and place it on a flat and hard surface free of static electricity. If you can't find the math coprocessor socket, consult your owner's manual or your computer dealer. If you want to install the Intel RapidCAD in a 386 system, you will have to remove the 386 CPU before starting to install the two RapiCAD chips. Intel provides an easy to use chip extractor and a storage box for the 386 chip for this purpose. Just follow the instructions in the RapidCad's installation manual. - Be sure you are properly grounded before you remove the coprocessor from its antistatic box. Static electricity can damage the coprocessor. Make sure you do not touch the pins. - Check if all pins are straight and not bend. If you find bent pins, carefully straigthen them with needle-nose pliers or tweezers. - Match the coprocessors orientation with the orientation of the socket. 8087 and 287 coprocessors have anotch on one the shorter sides of their rectangular DIL package that should be matched with the notch of the coprocessor socket. Usually the 286 CPU and the 287 coprocessor are placed alongside each other and both have the same orientation, that is their respective notches point in the same direction. 387SX coprocessors feature a white dot or similar mark that matches with some sort of marking on the socket. 387 coprocessors have a beveled corner that is also marked with a white dot or similar marking. This should be matched with the beveled or otherwise marked corner of the socket. If you install a 387 coprocessor in an EMC socket, leave one row of holes free on each side. Correct orientation of the coprocessor is absolutely essential, because if you insert it the wrong way it may be damaged. If you have found the correct orientation, make sure all pins are correctly aligned with their respective holes. Press firmly and evenly on the chip. You may have to press hard to seat the coprocessor all the way. Make sure your motherboard does not bend more than slighty under the insertion pressure. Otherwise it may develop cracks that could damage the signal lines on the board. For 8087, 287, and 387 coprocessors it is normal that the coprocessor does not go all the way in but about one millimeter (1/25 inch) of space is left between the socket and the bottom of the coprocessor chip. This enables the insertion of a extraction device should it become necessary to remove the coprocessor. Note that the construction of the 387SX's PLCC socket makes it next to impossible to remove the coprocessor once fully inserted, as the top of the chip is level with the socket's 'walls' then. 3) Check your computer's manual for the jumpers and/or switches you may have to set for coprocessor operation. Put the cover back on the system unit and reconnect the power. Turn on your computer. Depending on your BIOS, you may have to run the setup or configuration program to register the coprocessor. Use the diagnostic disk included with your coprocessor to check for correct operation of your coprocessor. Coprocessor emulations In the absence of a coprocessor, floating-point calculations are most often performed by a software package that simulates the operations of the coprocessor. Such a program is called a coprocessor emulator. Simulating the coprocessor has the advantage that identical code can be generated for the coprocessor and the emulator so that it is possible to write programs that run on both, systems with and systems without a coprocessor. Wether the program is to use the coprocessor or the emulator can then be decided at run-time by checking if a math coprocessor is present in the system. Two approaches to interface an 80x87 emulator to programs are common. While the first method works with all 80x86 processors, the second only works from the 80286 on. The first method makes use of the fact that all coprocessor instruction start with the same five bit pattern 11011. Thus the first byte of a coprocessor instruction will be in the range D8-DF hexadecimal. In addition, coprocessor instructions usually are preceeded by a WAIT instruction (opcode 9Bh) which is one byte long (the reason for doing this is described in a later chapter on the operation of the 80x87). One common approach is to replace the WAIT instruction and the first byte of the coprocessor instruction with one of eight interrupts; the remaining bytes of the coprocessor instruction are left unchanged. Interrupts 34 to 3B hexadecimal are used for this emulation technique. Note that the sequences 9B D8 .. 9B DF can be easily converted to the interrupt instructions CD 34 .. CD 3B by simple addition and subtraction of constants. The compiler or assembler produces code that contains the appropriate interrupt calls instead of the coprocessor instructions. If a coprocessor is detected at run-time, the emulator interrupts point to a short routine that converts the interrupts calls back to coprocessor instructions (self modifying code). If no coprocessor is found the interrupts point to an emulation package which examines the byte(s) following the interrupt intruction to determine what operation to perform. The method described is used by the compilers from Microsoft and Borland for example. It works with every 80x86 CPU from the 8086/8088 on. The second method to interface an emulator is only available on 286 and 386 machines. If the emulation bit in the machine status word of these processors is set, the processors will generate an interrupt 7 whenever a coprocessor instruction is encountered. The interrupt vector then points to an emulation package that decodes the instruction and performs the desired operation. This approach has the advantage that the emulator doesn't have to be included in the program code, but can be loaded as a TSR or device driver once and then used by every program that requires a coprocessor. Emulation via interrupt 7 is transparent, which means that programs containing coprocessor instructions execute just like a coprocessor was present, only slower. This approach is taken by the public domain EM87 emulator and the commercial Franke387 emulator, for example. Even programs that require a coprocessor to run like AutoCAD are 'fooled' to believe that a coprocessor is present with emulators using INT 7. The size of the emulator used by TP 6.0 is about 9.5 kB, EM87 occupies about 15.8 kB as a TSR, and Franke387 uses about 13.4 kByte as a device driver. Note that Franke387 and especially EM87 model a real coprocessor much more closely than Turbo Pascal's emulator does. In particular, EM87 supports denormal numbers, precision control, and rounding control. The emulator in TP 6.0 does not implement these features. The version of Franke387 tested (V2.4) supports denormals in single and double precision, but not double extended precision. It supports precision control, but not rounding control. Intel's E80287 is supposed to be an 100% exact emulation of the 80287 coprocessor [44]. Generally, the more closely a real coprocessor is modelled by the emulator, the slower does the emulator run and the larger the code for the emulator is. Relative execution times of coprocessor vs. software emulators for particular coprocessor instructions Intel 387DX TP 6.0 Emulator EM87 Emulator FADD ST, ST(0) 1 26 104 FDIV [DWord] 1 22 136 FXAM 1 10 73 FYL2X 1 33 102 FPATAN 1 36 110 F2XM1 1 38 110 The following table is an excerpt from [44]: Intel 80287 Intel E80287 Emulator FADD ST, ST(0) 1 42 FDIV [DWord] 1 266 FXAM 1 139 FYL2X 1 99 FPATAN 1 153 F2XM1 1 41 The following has been adapted from [43] and merged with my own data: Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086) FADD ST, ST(0) 1 20 94 FDIV [DWord] 1 22 82 FPTAN 1 18 144 F2XM1 1 6 171 FSQRT 1 44 544 One of the reasons emulators are so slow is that they are often designed to run with every CPU from the 8086/8088 on. This is the case with the emulators built into the compiler libraries of the Turbo Pascal 6.0 (also used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in other Microsoft products) and is also true for the EM87 emulator in the public domain. By using code that can run on a 8086/8088, these emulators forego the speed advantage offered by the additional instructions and architectureal enhancements (such as 32-bit registers) of the more advanced Intel 80x86 processors. A notable exception is the Franke387 emulator, a commercial emulator that is also sold as shareware. It uses 386 specific 32-bit code and only runs on 386/386SX computers. Besides being slow, coprocessor emulators have other drawbacks compared with real coprocessors. Most of the emulators do not support the additional instructions that the 387 compatible coprocessors offer over the 80287. Often, some of the low-level stack-manipulating instructions like FDECSTP are not emulated. The coprocessor status register is not or only partially emulated. Some emulators do not conform to the IEEE-754 standard in their implementation of the basic arithmetic functions, while the coprocessors do. Also, they sometimes lack the support for denormals (a special class of floating point numbers) although it is required by the standard. Not all the 80x87 emulators support rounding control (a feature required by IEEE-754) and precision control (a feature of the 80x87 coprocessor). Most of the ommisions are aimed at making the emulator faster and smaller. Because of the shortcomings of coprocessor emulators, a real coprocessor is a must for anybody planning to do some serious computations. At todays prices, this shouldn't pose much of a problem to anybody. Available coprocessors, CPU+FPU as of 08-10-92: Intel 8087 [43] was the first coprocessor that Intel brought out for the 80x86 family. It was introduced in 1980 and therefore does not have full compatibility with the IEEE-754 standard for floating point arithmetic, which was finally released in 1985. It complements the 8088 and 8086 CPUs and can also be interfaced to the 80188 and 80186 processors. It comes in a 40 pin CERDIP (ceramic dual inline package). It is available in 5 MHz, 8 Mhz (8087-2), and 10 MHz (8087-1) versions. The 8087 is implemented using NMOS. Power consumption is rated at max. 2400 mW [42]. A neat trick to enhance the processing power of the 8087 for computations that use only the basic arithmetic operations (+,-,*,/) and do not require high precision is to set the precision control to single precision. This gives one a performance increase of up to 20%. For details about programming the precision control, see program PCtrl in appendix A. Intel 80187 is a rather new coprocessor designed to support the 80C186 embedded controller. It was introduced in 1989 and implements the complete 80387 instruction set. It is available in a 40 pin CERDIP (ceramic dual inline package) and a 44 pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation. Power consumption is rated at max. 675 mW for the 12.5 MHz version and max. 780 mW for the 16 MHz version [37]. Intel 80287 [44] is the original Intel coprocessor for the 80286 and was introduced in 1983. It uses the same execution unit as the 8087 and therefore has the same speed (sometimes slower due to additional overhead in CPU coprocessor communication). As the 8087, it does not provide full compatibility with the IEEE-754 floating point standard released in 1985. It was manufactured in NMOS technology. There are 6 MHz, 8 MHz, and 10 MHz versions. The chip comes in a 40 pin CERDIP (ceramic dual inline package). Power consumption can be estimated to be the same as that for the 8087, which is max. 2400 mW. The 80287 has been replaced in the Intel 80x87 family with its successor, the Intel 287XL, which has been introduced in 1990. The 287XL is done in CMOS. It is based on the 387 core and therefore much faster than the 80287. There may still be a few of the old 80287 chips on the market though. Intel 80287XL is the second generation 287 introduced by Intel in 1990. Since it is based on the 387 core, it features full IEEE 754 compatibility and faster execution of coprocessor instructions. Intel claims about 50% faster operation than the 80287 for typical benchmark test such as Whetstone [45]. Comparison with benchmark results for the AMD 80C287, which is identical to the Intel 80287, support this claim [1]. The Intel 287XL performed 66% faster than the AMD 80C287 on the fractal benchmark and 66% faster on the Whetstone benchmark in these tests. Whetstone results from [46] show the Intel 287XL at 12.5 MHz to perform 552 kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91% performance increase. A benchmark using the MathPak program showed the Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0 sec.) [26]. Since the 287XL has all the additional instructions and enhancements of a 387, most software automatically identifies it as an 80387 compatible coprocessors and makes use of the extra features available like the FSIN and FCOS instructions. The 287XL is done in CMOS and therefore uses less power than the older 80287, which was done in NMOS. The 287XL is rated for speeds of up to 12.5 MHz. At 12.5 MHz, the power consumption is rated at max. 675 mW, about 1/4 of the 80287 power consumption. The 287XL comes in either a 40 pin CERDIP (ceramic dual inline package) or a 44 pin PLCC (plastic leaded chip carrier). The latter version is called the 287XLT and intended mainly for laptop use. AMD 80C287 is an exact clone of the old Intel 80287 that was brought to market by AMD in 1989. It contains the original microcode of the 80287 and is therefore 100% compatible with this chip. However, as the name indicates, the 80C287 is manufactured in CMOS and therefore uses less power than an equivalent Intel 80287. At 12.5 Mhz, its power consumption is rated at max. 625 mW or slightly less than that of the Intel 80287XL [27]. There is also another version called AMD 80EC287 that uses an 'intelligent' power save feature to reduce the power consumption below 80C287 levels. Tests at 10.7 MHz show typical power consumption for the 80EC287 to be at 30mW, compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and 1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally suited for low power laptop systems. The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. I have only seen it being offered in 10 MHz and 12 MHz versions though. At about US$ 50, it is the cheapest coprocessor available. Note that it provides less performance than the newer Intel 287XL (see above for details). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs (dual inline package) and as 44 pin PLCC (plastic leaded chip carrier). Due to recent legal battles with Intel over the right to use the 287 microcode, which AMD lost, AMD may have to discontinue this product (disclaimer: I am not a legal expert). Cyrix 82S87 was developed from the Cyrix 83D87, Cyrix' 387 'clone' and has been available since 1991. It implements the full 387 instruction set. It totally complies with the IEEE-754 standard for floating point arithmetic and features nearly total compatibility with Intel's coprocessors. It implements the transcendental functions with the same degree of accuracy and the superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the fastest [1] and most accurate 287 compatible coprocessor available. Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5 MHz system, while the Intel 287XL performs only 552 kWhets/sec. The 82S87 is a fully static CMOS design with very low power requirements that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the 82S87 to consume about the same amount of power as the AMD 80C287 (see above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded chip carrier) compatible with the pinout of the Intel 287XLT and ideally suited for laptop use. IIT 2C87 was the first 287 clone available. It was introduced to the market in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87 implements the full 387 instruction set [38]. Tests I ran on the 3C87 seem to indicate that it is not fully compatible with the IEEE-754 standard for floating-point arithmetic (see below for details), so it can be assumed that the 2C87 also fails these test as it presumably uses the same core as the 3C87. The IIT 2C87 provides extra functions not available on any other 287 chip [38]. It has 24 user accessible floating-point registers organized into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2) allow switching from one bank to another. Transfers between registers in different banks are not supported however, so this feature by itself is of limited usefulness. Also there seems to be only one status register (containing the stack top pointer), so it has to be manually loaded and stored when switching between banks with a different number of registers in use [40]. The register bank's main purpose is to aid the fourth additional instruction the 2C87 has (F4X4), which does a full multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D graphics applications [39]. The built-in matrix multiply speeds this operations up by a factor of 6 to 8 compared with a programmed solution according to the manufacturer [38]. Tests show the speed-up to be indeed in this range [40]. For the 3C87, I measured the execution time of F4X4 to be about 280 clock cycles, the execution time on the 2C87 should be somewhat bigger. I estimate it to be around 310 clock cycles due to the higher CPU-NDP communication overhead in instruction execution in 286/287 systems (~45-50 clock cycles) compared with 386/387 systems (~16-20 clock cycles). As useful as the F4X4 instruction may seem, there are only very few applications that make use of this feature if a IIT coprocessor is detected at run time, among them Schroff Development's Silver Screen and Evolution Computing's Fast-CAD 3-D [25]. The 2C87 is available for speeds of up to 20 MHz. It is implemented in an advanced CMOS process and has therefore a low power consumption of typically about 500 mW [38]. Intel 387 was the first generation of coprocessors for the Intel 386. It was introduced in 1986, about one year after introduction of the 80386. Early 386 system were therefore equipped with a 80287 and a 80387 socket. The 80386 works together with the 80287 but the numerical performance is hardly adequate for such a system. The 80387 has since been superseeded by the Intel 387DX introduced by a quiet change in 1990. You might find it when aquiring an old 386 machine, though. The 80387 is about 20% slower than the newer 387DX (see the paragraph below for detailed information). Like the other 387 coprocessors, the 80387 is packaged in a 68-pin ceramic PGA. The Intel 80387 is manufactured using Intel's older 1.5 micron CHMOS III technology that has moderate power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW typical), at 20 MHz it is max. 1550 mW (950 mW typical), and at 25 MHz it is max. 1950 mW (1250 mW typical) [60]. Intel 387DX is the second generation Intel 387 that was quietly introduced in 1989. This version is done in a more advanced CMOS process than the 80387 that enables the coprocessor to run at a maximum frequency of 33 MHz, while the 80387 had a maximum frequency of 25 MHz. The 387DX is about 20% faster than the 80387 on the average for the same clock frequency. For a 386/387 system operating at 29 MHz the Whetstone benchmark compiled with the highly optimizing Metaware High-C V1.6 runs at 2377 kWhetstones/sec for the 80387 and at 2693 kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation programmed in assembly language, the 387DX performance was 28% higher than the performance of the 80387. The transcendental functions have also sped up from the 80387 to the 387DX. In the Savage benchmark compiled with the Metaware High-C V1.6 optimizing compiler and running on a 29 MHz system, the 80387 evaluated 77600 function calls/second, while the 387DX evaluated 97800 function calls/second, a 26% increase [7]. Some instructions have been sped up a lot more more than the average 20%. For example the FBSTP instruction has been sped up by a factor of 3.64. The Intel 387DX (and its predecessor 80387) are the only 387 coprocessors that support asynchronous operation of CPU and NDP. The 387 consists of a bus interface unit and a numerical execution unit. The bus interface unit always runs at the speed of the CPU clock (CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc, the numerical execution unit runs at the same speed as the bus interface unit. If CKM is tied to ground, the numerical execution unit runs at the speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10. For example, for a 20 MHz 386, the Intel 387DX could be clocked from 12.5 MHz to 28 MHz via the NUMCLK2 input. On the Cyrix 83D87, Cyrix 387+, ULSI 83C87, and the IIT 387, the CKM pin is not connected. These coprocessors always run at the speed of the CPU. The Intel 387DX is manufactured using Intel's advanced low power CHMOS IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW typical), at 25 MHz it is max. 1050 mW (625 mW typical), and at 33 MHz it is 1250 mW (750mW typical) [59]. Intel 387SX is the coprocessor for the Intel 386SX. The 386SX is an Intel 386 with a 16-bit data path. This reduces somewhat the costs to build a complete system as compared to a full 32-bit design required by the 80386DX. The 386SX main purpose was to replace the 80286 CPU, which Intel subsequently stopped producing. Due to the 16-bit data path, the 386SX is slower than the 386DX and offers about the same speed as a 80286 at the same clock frequency for 16-bit applications. As the 386SX is a complete 80386, it offers also the possibility to run 32-bit applications and supports the virtual 8086 mode used for example by Windows' enhanced mode. The 387SX has all the features the Intel 387DX offers, including the ability for asynchronous operation of CPU and coprocessor (see the above paragraph on the Intel 387DX for details). Due to the 16 bit data path between the CPU and the coprocessor, the 387SX is a bit slower than a 387DX operating at the same frequency. The 387SX comes in a 68-pin PLCC (pastic leaded chip carrier) package and is available in 16 Mhz and 20 MHz versions. Coprocessors for faster 386SX systems based on the Am386SX CPU are available from IIT, Cyrix, and ULSI. Power consumption for the 387SX at 16 MHz is max. 1250 mW (740 mW typical), for the 20 MHz version it is max. 1500 mW (1000 mW typical) [62]. IIT 3C87 came out in 1989 at about the same time as the Cyrix 83D87. Both coprocessors are faster than Intel's 387DX coprocessor. Tests I ran with the IEEETEST program show that the 3C87 is not fully compatible with the IEEE-754 standard for floating-point arithmetic although the manufacturer claims differently. It is well possible that the reported errors are due to personal interpretations of the standard by the program's author that have been incorporated into IEEETEST and that the standard also supports the different interpretation chosen by IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST have become somewhat of an industry standard [66] and Intel's 387, 486, and RapidCAD chips pass the test without a single failure, so the fact that the IIT 3C87 fails some of the tests indicates that it is not fully compatible with the Intel 387 coprocessor. My tests also show that the IIT 3C87 does not support denormals for the double extended format. It is not entirely clear wether the IEEE standard mandates support for extended precision denormals, as the IEEE-754 document explicitly only mentions single and double precision denormals. Missing support for denormals is not a critical issue with most applications but there are some programs for which support of denormals is quite helpful, if not important [41]. Anyhow, failure of the 3C87 to support extended precision denormal numbers is an incompatibility with the Intel 387 and 486. The 3C87 provides extra functions not available on any other 387 chip [38]. It has 24 user accessible floating-point registers organized into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2) allow switching from one bank to another. Transfers between registers in different banks are not supported however, so this feature by itself is of limited usefulness. Also there seems to be only one status register (containing the stack top pointer), so it has to be manually loaded and stored when switching between banks with a different number of registers in use [40]. The register banks main purpose is to aid the fourth additional instruction the 3C87 has (F4X4), which does a full multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D graphics applications [39]. I measured this instruction to execute in about 280 clock cycles, during which time it executes 16 multiplications and 12 additions. The built-in matrix multiply speeds the matrix by vector multiply up by a factor of 3 compared with a programmed solution according to IIT [39]. The results for my own TRNSFORM benchmark support this claim (see results below), showing a performance increase by a factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly as fast as on an Intel 486 at the same clock frequency. However, there are only very few applications that make use of this feature if a IIT 3C87 is detected at run time, among them Schroff Development's Silver Screen and Evolution Computing's Fast-CAD 3-D [25]. Like the 387 'clones' from Cyrix and ULSI, the 3C87 does not support asynchronous operation of the CPU and the coprocessor. The 3C87 always runs at the full speed of the CPU. The 3C87 is implemented in an advanced CMOS process and has low power requirements of typically about 600 mW. It is available in 16, 20, 25, 33, and 40 MHz versions. IIT 3C87SX is the version of the IIT 3C87 that is intended for use with Intel's 386SX or AMD's Am386SX CPU. It is functionally equivalent to the IIT3C87. Due to the 16-bit data path between the CPU and the coprocessor in a 386SX based system, coprocessor instructions will execute somewhat slower than on the 3C87. The IIT 3C87SX is the only 387SX coprocessor that is offered at speeds of 16, 20, 25, and 33 MHz right now. I have read that Cyrix has also annouced a 83S87-33, but haven't seen it being offered yet. The 3C87SX is packaged in a 68-pin PLCC. Cyrix 83D87 was introduced in 1989, only shortly after the coprocessors from IIT. It has been the fastest 387 compatible coprocessor in several benchmark comparisons [1,7,68,69]. It also came out as the fastest coprocessor in my own tests (see benchmark results below). Although the Cyrix 83D87 provides up to 50% more performance than the Intel 387DX in benchmarks comparisons, the speed advantage over other 387 compatible coprocessors in real applications is usually much smaller. For example, in a test using the program 3D-Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1]. Besides being the fastest 387 coprocessor, the 83D87 also offers the most accurate transcendental functions results of all coprocessors tested (see test results below). Unlike the Intel coprocessors, which use the CORDIC [18,19] algorithm to compute the transcendental functions, Cyrix uses rational approximations to the functions. In the past the CORDIC method has been popular since it requires only shifts and adds which makes it easy to implement. It is also reasonably fast. Recently, the cost for the implementation for fast floating-point multipliers has dropped significantly due to the availablity of VLSI, making the use of rational approximations superior to CORDIC for the generation of transcendental functions [61]. The Cyrix 83D87 uses a very fast array multiplier, making its transcendental functions faster than those of any other 387 compatible coprocessor. It also uses 75 bit for the mantissa for intermediate calculations (as opposed to 68 bits on other coprocessors), making its transcendental functions more accurate than those of any other coprocessor or FPU (see results below). The 83D87 and its successor, the 387+ are the 387 'clones' with the highest degree of compatibility. There are only very few SW and HW incompatibilties with the Intel 387DX. These have been documented by Cyrix [12]. The software differences are caused by some bugs present in the 387DX that Cyrix fixed for the 83D87. Unlike the Intel 387DX, the 83D87 (and all other 387 'clones' as well) does not support asynchronous operation of CPU and coprocessor. There have also been problems in the past with the CPU - coprocessor communication, causing the 83D87 to hang on some machines. The reason was that Cyrix shaved off a wait state in the communication protocol, which caused a communications breakdown between the CPU and the 83D87 for some systems running at 25 MHz or faster. One notable example of this behavior was the Intel 302 board. The problem is only rarely encountered with the current generation of 386 motherboards. It is possible that the problem has been entirely eliminated in the 387+, the sucessor to the 83D87. To reduce power consumption the 83D87 features advanced power saving features. Those portions of the coprocessor that are not needed are automatically shut down. If no coprocessor instructions are being executed, all parts except the bus interface unit are shut down [12]. Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, typical power consumption at this clock frequency is 500 mW [15]. Cyrix EMC87 is basically a special version of the Cyrix 83D87. In addition to the normal 387 operating mode, in which coprocessor-CPU communication is handled thru reserved IO-ports, it also offers a memory-mapped mode of operation similar to the operation principle of the Weitek Abacus. Please note that the EMC87 is *not* compatible with Weitek's Abacus coprocessor. They both use the same interface technique (memory mapping) but while the EMC87 uses the standard 387 instruction set, the Weitek coprocessors use a different instruction set of their own. Like the Weitek Abacus, the EMC87 occupies a 64 kByte memory block starting at physical address C0000000h. It can therefore only be accessed in the protected or virtual modes of the 386 CPU. DOS programs can access the EMC87 with the help of DOS-extenders or memory managers like EMM386 which run in protected/virtual mode themself. Since the EMC87 provides also the standard CPU interface via IO-ports, it can be used just like any other 387 compatible coprocessor and delivers the same performance as the Cyrix 83D87 in this mode. However, using the memory mapped mode of the EMC87 provides a significant speed advantage. The traditional 387 CPU- coprocessor interface via IO-ports has an overhead of about 16-20 clock cycles. Since the Cyrix 83D87 executes some operations like addition and multiplication in much less time, its performance is limited by the CPU-coprocessor interface. The memory-mapped mode has much less overhead and allows all coprocessor instructions to be executed at full speed and with no penalty. For this reason, Cyrix introduced the EMC87 in 1990. In a test, the EMC87 at 33 MHz ran the single precision Whetstone benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In another test, the EMC87 ran a fractal computation at two times the speed of the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third test found the EMC87's overall performance to be 20% higher than the performance of the Cyrix 83D87 [65]. The Cyrix FasMath EMC87 has also been sold as Cyrix AutoMATH by Cyrix. The two chips are 100% identical. Unlike the Cyrix 83D87, which fits into the 68-pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that not all boards have such a socket, a notable exception being IBM's PS/2s, for example. Originally, Cyrix claimed support for the fast memory mapped mode of the EMC87 from a lot of software vendors (including Borland and Microsoft). However, there are only very few applications that make use of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP FORTRAN-386 compiler and Intusofts's Spice [63]. I haven't seen the EMC being offered for about nine month now. It may be that Cyrix has discontinued this product due to lack of sufficient software support. The EMC87 was available in 25 and 33 MHz versions at the end of 1991. Cyrix 387+ seems to be the successor to the Cyrix 83D87. On ordering a Cyrix coprocessor about a month ago, I was automatically supplied with a 387+. In my tests, I found the Cyrix 387+ to be about five to 10 percent *slower* than the Cyrix 83D87. However, some instructions like the square root (FSQRT) now ony run at half the speed at which they ran in the 83D87 (see performance results below). I also found the transcendental functions on the 387+ to be a bit more accurate than those implemented in the 83D87. Why Cyrix has brought out a new coprocessor slower than the 83D87 I don't know. I have written to Cyrix about this question but haven't received a reply yet. Maybe the new coprocessor solves the one small hardware compatibility problem the 83D87 had (see above paragraph on the 83D87). It could also be that Cyrix had to design around the three Intel patents Intel claims the 83D87 has violated. I have no idea wether the Cyrix 387+ is to replace the 83D87 or if both chips will coexist in the market. Like the 83D87, the 387+ is available for speeds of up to 40 MHz. Cyrix 83S87 is the SX version of the Cyrix 83D87. Just like the Cyrix 83D87 is the fastest 387 compatible coprocessor, the Cyrix 83S87 is the fastest of the 387SX compatible coprocessor [1]. Besides being the fastest 387SX 'clone', the Cyrix 83S87 also features the most accurate transcendental functions. The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20 and 25 MHz versions. Due to the advanced power saving features of the Cyrix coprocessor, the typical power consumption of the 20 MHz version is about 350 mW [67]. ULSI 83C87 is a 387 'clone' that came out in early 1991, well after the IIT 3C87 and Cyrix 83D87. Like all clones, it is somewhat faster than the Intel 387DX. Especially the basic arithmetic functions are fast, while the transcendental functions show only a slight speed improvement over the Intel 387DX (see benchmark results below). In my tests, the ULSI had the most inaccurate transcendental functions. However, the maximum relative error is still within the limits set by Intel, so this is probably not an important issue in all but very few applications. The ULSI shows some minor flaws in the tests for IEEE-754 compatiblity, but this, too, is unimportant under typical operating conditions. ULSI claims that the program IEEETEST, which was used to test for IEEE compatibility, contains many personal interpretations of the IEEE standard by the program's author and states that there is no ANSI-certified IEEE-754 complicency test. While this is most probably true, it is also a fact that the IEEE test vectors used in IEEETEST are sort of an industry standard and that Intel's 387, 486, and RapidCAD chips pass it without a single failure. Since the ULSI Math*Co 83C87 fails some of the tests, it is certainly less than 100% compatible with Intel's chips, although this will hardly make any difference in typical operating conditions. The ULSI 83C87 is also not fully compatible with the Intel 387DX in that is does not implement the precision control feature of Intel's coprocessor [58]. While all the internal operations of 80x87 coprocessors are usually done with the maximum precision available (double extended presision with 64 mantissa bits), the 80x87 also offer the possiblity to force lower precision to be used for the basic arithmetic functions add, subtract, multiply, divide, and square root. This feature was included for compatiblity with existing floating-point implementations at the time the 8087 was devised. All coprocessors except the ones from ULSI support this feature. Since precision control is rarely used, this incompatibility with the Intel 387DX does not pose major problems. IEEE-754 mentions precision control, but requires it only for those systems that don't have the possibility to store single and double precision results. Therefore, the standard does not call for precision control in the 387 coprocessor, so the ULSI 83C87's failure to provide rounding control does not constitute a conflict with the IEEE-754 standard for floating point arithmetic. Like the other 387 'clones', the 83C87 does not support asynchronous operation of the CPU and the coprocessor. This means that the 83C87 always runs at the full speed of the CPU. The ULSI 83C87 is available in 20, 25, 33, and 40 MHz versions. The ULSI is produced in high perfromance, low power CMOS. Power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625 mW), and at 40 MHz the ULSI Math*Co 83C87 consumes max. 1500 mW (750 mW typical) [58]. The 83C87 is packaged in a 68-pin ceramic PGA. ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc. will replace the coprocessor up to three times free of charge should it ever fail. ULSI 83S87 is the SX version of the ULSI 83C87 for operation with an Intel 387SX or an AMD Am387SX. It is functionally equivalent to the 83C87. To aid low power laptop designs, the ULSI 83S87 features an advanced power saving design with a sleep mode and a standby mode with only minimal power requirements. Power consumption under normal operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25 MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC. Intel RapidCAD is not a coprocessor, strictly seen, although it is marketed as one. Rather, it is a CPU replacement. It is basically an Intel 486DX without the cache and with a 386 pinout. RapidCAD is delivered as a set of two chips. RapidCAD-1 goes into the 386 socket and contains the CPU and FPU, RapidCAD-2 goes into the coprocessor socket and contains a PAL that generates the Ferr signal that is normally generated by a coprocessor and used by the motherboard circuitry to provide 287 compatible coprocessor exception handling in 386/387 systems. The RapidCAD instruction set is compatible with the 386, so it doesn't know the 486 specific instructions like BSWAP. Since the RapidCAD CPU core is very similar to 486 CPU core, most of the register to register instructions execute in the same number of clock cycles as on the 486. The use of the 386 bus interface causes instructions that access memory to execute at about the same speed as on the 386. The integer performance on the RapidCAD is definitely limited by the low memory bandwidth provided by the 386 bus interface (2 clock cylces per bus cycle) and the lack of an internal cache. CPU instructions often execute faster than they can be fetched from memory, even with a big and fast external cache. Therefore, the integer performance of the RapidCAD exceeds that of a 386 by at most 25%. This value was derived by running some programs that use mostly register-to-register operations and few memory accesses. This finding is supported by the SPEC ratings that Intel reports for the 386-33 and the RapidCAD-33. While the 386-33 has a SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase. Note that these tests used the old (1989) SPEC benchmarks suite. While CPU instructions often execute in one clock cycle on the RapidCAD, FPU instructions always take more than seven clock cycles. They are therefore rarely slowed down by the low memory bandwidth provided by the 386 bus interface. My tests show a 70%-100% performance increase for floating-point intensive benchmarks (see below) over a 386 based system using the Intel 387DX math coprocessor. This is consistent with the SPECfp rating reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while the RapidCAD is rated at 6.1 SPECfp at the same frequency, a 85% increase. This means that a system that uses the RapidCAD is faster than any 386/387 combination, regardless of the type of 387 used (Intel 387DX or faster clone). The diagnostic disk for the RapidCAD also gives some application performance data for the RapidCAD compared to the Intel 387DX: Application Time w/ 387DX Time w/ RapidCAD Speedup AUTOCAD 11 32 sec 52 sec 63% AutoShade/Renderman 108 sec 180 sec 67% Mathematica(Windows) 103 sec 139 sec 35% SPSS/PC+ 4.01 14 sec 17 sec 21% RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed through other channels than the other Intel math coprocessors. Therefore, I have been unable to obtain a data sheet for it. The RapidCad-1 chip gets quite hot when operating and it can be assumed that its power consumption is similar to the 486-33. Therefore, I recommend extra cooling for this chip (see the paragraph below on the 486 for details). The RapidCAD-1 is packaged in a 132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a 68-pin PGA like a 80387 coprocessor. Intel 486DX is not a coprocessor. This chip, brought out in 1989 functionally combines the CPU (a heavily pipelined implementation of the 386 architecture) with an enhanced 387 (the floating-point unit, FPU) and 8 kB of unified code/data cache on one chip. Of course, this description is simplified, for a detailed hardware description, see [52]. The 486DX offers about two to three times the integer performance of a 386 at the same frequency. Floating point performance is about three to four times as high as on the Intel 387DX at the same clock rate [29]. Since the FPU is on the same chip as the CPU, the considerable communication overhead between CPU and coprocessor in a 386/387 system is omitted, letting FPU instructions run at the full speed permitted by the implementation. The FPU also takes advantage of the on-chip cache and the highly pipelined execution unit. Besides the higher speed, the 486 FPU features more accurate transcendental functions than the Intel 387DX coprocessor according to tests run by me (see below). To achieve better interrupt latency, FPU instructions with a long execution time have been made abortable in the case an interrupt occurs during their execution. The concurrent execution of CPU and coprocessor instructions typical for 80x86/80x87 systems is still in existence on the 486, but some FPU instructions like FSIN have nearly no concurrency with CPU instructions, indicating that they make heavy use of both, CPU and FPU resources [53, 1]. The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is available in 25 MHz and 33 Mhz versions. Since the end of 1991, there is also a 50 MHz version available done in a CHMOS V process (the 25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500 mW for the 33 MHz version (3500 mW typical), and 5000 mW (4000 mW typical) for the 50 MHz chip. Due to the considerable amount of heat produced by these chips, and taking into consideration the slow air flow provided by the fan in garden variety PC tower cases, I recommend an extra fan directly above the CPU for safer operation. If you measure the surface temperature of an i486 in a normal tower case without extra cooling after some time of operation, you may well come up with something like 80 - 90 degrees Celsius (that is 176 - 194 degrees Fahrenheit for those not familiar with metric units) [54,55]. You don't need the well known and expensive IceCap(tm) to effectively cool your CPU. A simple fan mounted directly above the CPU can bring the temperature down to about 50 to 60 degrees Celsius (122 - 140 degrees Fahrenheit) depending on the room temperature and the temperature within the PC case (which depends on the total power dissipation of all the components and the cooling provided by the fan in the power unit). According to a simple rule known as Arrehnius' Law, lowering the temperature by 10 degrees Celsius slows down chemical reactions by a factor of two, thus lowering the temperature of your CPU by 30 degrees should prolong the live of the device by a factor of eight due to the slower aging process. If you are reluctant to add a fan to your system because of the additional noise, settle for a low-noise fan like those available from the German manufacturer Pabst (this is not meant to be an advertisement. I am just the happy owner of such a fan. Besides that, I have no connections to the firm). Intel 486DX2 is the name for Intel latest generation of 486 CPUs. Using the DX2 suffix instead of simply DX is meant to be an indicator that these are clock-doubled versions. A normal 486DX operates at the frequency provided by the incoming clock signal. A 486DX2 generates a new clock signal from the incoming clock by means of a PLL (phase locked loop). In the DX2, this clock signal has twice the frequency of the incoming clock, hence the name clock-doubler. All internal parts of the 486DX2 (cache, CPU core, FPU) run at this higher frequency. Only the bus interface runs at the normal speed. That way, a 486DX-50 can run on a motherboard designed for 25 MHz operation. Since motherboards for 50 MHz operations are much harder to design than those for 25 Mhz, this makes a 486DX2-50 system easier to built and cheaper than a 486DX-50 system. For all operations that don't access off-chip resources (e.g. register operations) a 486DX2-50 provides exactly the same performance as a 486DX-50 and twice the performance of a 486DX-25. However, since the main memory in a 486DX2-50 systems still operates at 25 MHz, all instructions involving memory accesses are potentially slower than in a 486DX-50 system, whose memory also runs at 50 Mhz. The internal cache of the 486 helps this problem a bit, but overall performance of a 486DX2-50 is still lower than that of a 486DX-50, although Intel's documentation [32] shows this drop to be quite small. It depends a lot on the code one runs, though. The nice thing about the 486DX2 is that it allows easy upgrading of 25 and 33 Mhz 486 systems, since the 486DX2 is completely pin-compatible with the 486DX. Just take out the 486DX and plug in the new 486DX2. Note that power consumption of the 486DX2-50 equals that of the 486DX-50 (4000 mW typical), and that the 486DX2-66 exceeds this by about 30%. These chips get really hot in a standard PC case with no extra cooling. See the above paragraph for more detailed information on this problem. Intel 487SX is the coprocessor intended for use in 486SX systems. The 486SX is basically a 486DX without the floating- point unit (FPU) [48, 50]. Originally Intel sold 486DXs with a defective FPU as 486SXs but it has now completly removed the FPU part from the 486SX mask for mass production. The introduction of the 486SX in 1991 has been viewed mainly as a marketing 'trick' by Intel to take market share from the 386 based systems once AMD became successful with their Am386 (AMD has taken as much as 40% of the 386 market due to some superior features such as higher clock frequency, lower power consumption, and a fully static design). A 486SX at 20 MHz delivers a bit less integer performance than a 40 MHz Am386. To add floating-point capabilities to a 486SX based system, it would be easiest to swap the 486SX with a 486DX which includes the FPU. However, Intel has prevented this easy solution by giving the 486SX a slightly different pin out [48, 51]. Since only three pins are assigned differently, clever board manufacturers have come out with boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU socket and provide a clean upgrade path this way. A set of three jumpers ensures correct signal assignment to the pins for either configuration. To upgrade systems without this feature, one has to buy the 487SX and put it into the "Performance Upgrade Socket" present in most 486SX systems. Once the 487SX was available, it was quickly found out that it is just a normal 486DX with a slightly different pin out [49]. Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX system, so the 486SX could be removed once the 487SX is installed. Since the shut down is logical, not electrical, the 486SX still uses power if used with the 487SX, although it is unoperational. Technically speaking, the solution Intel chose was the only practical way to provide a 486SX system with the high level of floating-point performance the 486DX offers. The CPU and FPU have to be on the same chip, otherwise the FPU can not make use of the cache on the CPU chip and there would be considerable overhead in CPU-FPU communication (similar to a 386/387 system), nullifying most of the arithmetic speedups over the 387. That the 486SX, 487SX, and 486DX are not pin-compatible seems to be purely for marketing reasons. To upgrade a 486SX based system, Intel also offers the OverDrive chip, which is just the same as a 487SX with internal clock doubling. It goes also goes into the "Performance Upgrade Socket" found in 486SX systems. The OverDrive roughly doubles the performance of a 486SX/487SX based system. For a explanation of clock doubling, see the description of the 486DX2 above. As the 486SX, the 487SX is available in 20 MHz and 25 MHz versions. At 20 MHz, the 487SX has a power consumption of max. 4000 mW. It is available in a 169 pin ceramic PGA (pin grid array). Weitek 3167 was introduced in 1989 to provide the fastest floating point performance possible on a 386 based system at that time. The Weitek Abacus 3167 is not a real coprocessor, strictly speaking, but rather a memory mapped peripheral device. The Weitek 3167 was optimized for speed wherever possible. Besides using the faster memory mapped interface to the CPU (the 80x87 uses IO-ports), it does not support many of the features of the 80x87 coprocessors, allowing all of the chip's ressources to be concentrated on the fast execution of the basic arithmetic operations. For a more detailed description of the Weitek 3167 see the first chapter of this document. In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167 the Whetstone benchmark performed at 7574 kWhetstones/sec compared with the the 3743 kWhetstones/s for the Intel 387DX. Note however that these are single precision results and that the Weitek 3167's performance would drop to about half the stated rate for double precision, while the value for the Intel 387DX would not change much. Anyhow, before the advent of the Intel RapidCAD, the Weitek 3167 usually beat all 387 compatible coprocessors even for double precision operations [63,65,69]. For typical applications the advantage of the Weitek 3167 over the 387 clones is much smaller. In a benchmark test using AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel 487DX's perfromance comapred with 106% for the Cyrix FasMath 83D87 and 118% for the Intel RapidCAD. The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an EMC socket provided by most 386 based systems. It does *not* fit into the normal coprocessor socket designed to hold a 387 compatible coprocessor in a 68-pin PGA. To get the best of both worlds, one might want to use a Weitek 3167 and a 387 compatible coprocessor in the same system. These coprocessors can coexist in the same system just fine. Only problem is that most 386 based systems contain only one coprocessor socket, usually of the EMC (extended math coprocessor) type. Thus, you can install either a 387 coprocessor or a Weitek 3167, but not both. There are little daughter boards available though that fit into the EMC socket and provide two sockets, an EMC and a standard coprocessor socket. At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At 33 MHz, the max. power consumption is 2250 mW. Weitek 4167 is a memory mapped coprocessor that has the same architecture as the 3167 and is designed to provide 486 based systems with the highest floating point performance available. It executes coprocessor instructions at three to four times the speed of the Weitek 3167. Although it is up to 80% faster than the Intel 468 in some benchmarks [1,69], the performance advantage for real application is more like 10%. The introduction of the 486DX2 processors has more or less obliterated the need for a Weitek 4167, since the DX2 CPUs provide the same performance and all the additional features the 80x87 has over the Weitek Abacus. The Weitek 4167 is packaged in a 142-pin PGA package that is only slightly smaller than the 486's package. At 25 MHz, it has a max. power consumption of 2500 mW [32]. Chips & Technologies has shipped samples of their 38700 and 38700SX coprocessors, which are compatible with the Intel 387DX and Intel 387SX coprocessors, respectively. Both have already been tested in [1]. However, C&T's German distributor (Rein Elektronik, Nettetal) states that these coprocessors will become generally available not before 4Q 1992. The samples tested in [1] showed about the same performance as the Cyrix 83D87. Pricing Due to a recent price slashing by Cyrix and subsequently by Intel for 387 coprocessors, prices have dropped significantly for all 287 and 387 compatible coprocessors with hardly any price difference between manufacturers. 387DX compatible coprocessors typically sell for ~US$ 100 for all speeds except for 40 MHz versions which are typically ~US$ 130. 387SX compatible coprocessors sell for ~US$ 90 regardless of speed with the exception of the 33 MHz version, which are ~US$ 100. The Intel 287XL sells for ~US$ 100, while the IIT 2C87 and Cyrix 82S87 sell for about US$ 70. 8087s may be more expensive, the price of an 8087-10 being US$ 150. I bought the Intel RapidCAD for US$ 320 and haven't seen it offered for a better price. I see the Weitek Abacus 3167-33 being offered for US$ 780 and the 4167-33 being offered for US$ 1100. This price information reflects the price situation as of 08-14-92. Prices can be expected to drop slightly in the near future. If you have a demand for high floating-point performance, you should consider to buy a 486 based system rather than buying a 386 based system with an additional coprocessor. A 386 mother board for 33 MHz operation sell for ~ US$ 300, together with the coprocessor, costs total ~ US$ 400. A 486-33 ISA-board sells for US$ 650. While the 486-33 system is 60% more expensive than the 386/387 system, it also provides 100% more integer and floating- point performance (twice the performance). If you want to push your 386 based system to maximum floating-point performance and can't switch to a 486 based system for some reason, I recommend the Intel RapidCAD. It is both faster [1] and cheaper than installing a Weitek Abacus 3167 with your 386, which used to be the highest performing combination before the RapidCAD came out. Similarily, the introduction of the 486DX2 clock-doubler chips have obliterated the need for a Weitek 4167 to get maximum floating-point performance out of a 486 based system. A 486DX2-66 performs at or above the performance level of a 33 Mhz Weitek 4167, even if the latter uses single precision rather than double precision. The 486DX-66 is rated by Intel at 24700 double precision kWhetstones/sec and 3.1 double precision Linpack MFLOPS. Of course, these benchmarks used the highest performance compilers available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6 double precision MFLOPS out of the 486DX2-66 for the LLL benchmark (for a description of the benchmarks mentioned, see the paragraph on benchmarks below). Although I haven't yet seen 486DX2-66 processors seen offered to the end users for upgrade purposes, I'll recommend the 486DX2-66 to those that need highest floating-point performance and are planning on buying a new PC. The price difference between a 33 MHz 486DX motherboard and a 486DX2-66 motherboard is around US$ 600, well below the price for the Weitek Abacus 4167. Operation In a 80x86/80x87 system CPU instructions and coprocessor instructions are executed concurrently. This means that the CPU can execute CPU instructions while the coprocessor executes a coprocessor instruction at the same time. The concurrency is restricted somewhat by the fact that the CPU has to aid the coprocessor in certain operations. As the CPU and the coprocessor are fed from the same instruction stream and both instruction streams may operate on the same data, there has to be a synchronizing mechanism between the CPU and the coprocessor. 8086/8087 or 8088/8087 system, both of the chips look at the opcodes coming in from the bus. To do this, both chips have the same BIU (bus interface unit) and the 8086 BIU sends the status signals of its prefetch queue to the 8087 BIU. This assures that both processors always decode the same instructions in parallel. Since all coprocessor instruction start with the bit pattern 11011, it is easy for the 8087 to ignore all other instructions. Likewise the CPU ignores all coprocessor instructions except if they access memory. In this case, the CPU computes the address of the LSB (least significant byte) of the memory operand and does a dummy read. The 8087 then takes the data and does a dummy read. from the data bus. If more than one meory access is needed to load an memory operand, the 8087 requests the bus from the CPU, generates the consecutive addresses of the operand's bytes and fetches them from the data bus. After completing the operation, the 8087 hands bus control back to the CPU. Since 8087 and CPU are hooked up to the same synchronous bus, they have to run at the same speed. This means that with the 8087, only synchronous operation of CPU and coprocessor is possible. Another 8087 coprocessor instruction can only be started if the previous one has been completed in the NEU (numerical execution unit) of the 8087. To prevent the 8086 from decoding a new coprocessor instruction while the 8087 is still excuting the previous coprocessor instruction, the following mechanism is used: The compilers and assemblers automatically generate a WAIT instruction before each coprocessor instruction. The WAIT instruction tests the /TEST pin until its input becomes "LOW". In 8086/8087 systems, the 8086 /TEST pin is connected to the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it forces its BUSY pin "HIGH". Thus the WAIT instruction in front of every coprocessor instruction stops the CPU until a still executing previous coprocessor instruction has finished. The same synchronization is used before the CPU accesses data that was written by the coprocessor. A WAIT instruction after the coprocessor instruction that writes to memory causes the CPU to stop until the coprocessor has transferred the data to memory, after which the CPU can safely access the data. With the help of an additional chip, the 8087 can also be inter- faced to the 80186 [36]. The 80186 was the CPU in some PCs (e.g. from Philips, Siemens) in the 1982/1983 time frame, but with the introduction of the IBM AT which used the 80286, it lost all significance for the PC market. The 80C186 (CMOS version of the 80186) nowadays sells as an embedded controller and can be combined with a 80C187 coprocessor which is based on the internals of the Intel 387 [37]. The 80287 CPU-interface is totally different from the solution used in the 8087. Since the 80286 implements memory protection via an MMU based on segmentation, it would have been much to expensive to duplicate the whole protection logic on the coprocessor for an interface solution similar to the 8087. In a 80286/80287 system, the CPU fetches and stores all opcodes and operands for the coprocessor. Information is passed through ports F8h - FFh. As these ports are accessible under program control, care must be taken to not accidentally perform write operation to them, as this could corrupt the information in the math coprocessor. The execution unit of the 80287 is practically identical to that of the 8087, that is, nearly all coprocessor instructions execute in the same number of clock cycles on both coprocessors. Due to the additional overhead of the CPU/coprocessor interface (at least ~40 clock cycles), a 8 MHz 80286/80287 combination can be slower than a 8086/8087 system running at the same speed for floating point intensive programs. Additionally, most of the older 286 boards were configured to run the coprocessor at 2/3 the speed of the CPU, making use of the ability of the 80287 to run asynchronous with the CPU. The 80287 has a CKM pin that causes the incoming system clock to be divided by three for the coprocessor if it is tied to ground. The 80286 always divides the system clock by two internally. Thus the ratio 2/3. However, when the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the CLK input. This feature has been exploited by the maker of coprocessor speed sockets. These sockets tie CKM high and supply their own CLK signal with a built-in oscillator, thereby allowing the 80287 or compatible to run at a much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20 MHz coprocessor running with a 8 MHz 80286. Note however that the floating-point performance in such a configuration does not scale linearly with the coprocessor clock, since all the data has to be passed through the much slower CPU. If the coprocessor executes mostly simple intructions such as addition and multiplication doubling the coprocessor clock in a 10 MHz system to 20 MHz does not show any performance increase at all [24]. The 80C287 by AMD is a 100% clone of the original Intel 80287, but is produced in CMOS not in NMOS as the original Intel chip. This makes for lower power consumption. The 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of a 387 coprocessor, but are pin-compatible to the original 287. However, these chips divide the system clock by two internally, as opposed to three in the original Intel 80287. Since the 80286 also divides the system clock by two, they usually run synchronously with the CPU. They can also run asynchronously, though. The 8087/8087 combination can be characterized as a cooperation of partners with equal rights, while the 80286/287 is more a master- slave relationship. This makes synchronization much more easy, since the complete instruction and data flow of the coprocessor goes thru the CPU. Before executing most coprocessor instructions, the 80286 tests its /BUSY pin which is hooked up to the 287 coprocessor and signals if the 80287 is still executing a previous coprocessor instruction or has encountered an exception. The 80286 then waits until the 80287 is not busy before loading the coprocessor instruction into the coprocessor. Therefore, a WAIT instruction before every coprocessor instruction is not required. These WAITs are permissible, but not necessary in 80287 programs. The second form of WAIT synchronisation after the coprocessor has written a memory operand is still necessary on 286/287 systems. The coprocessor interface in 80386/80387 systems is very similar to the one found in 286/287 systems. However, to prevent corruption of the coprocessor's contents by programming errors, the IO-ports 800000F8 - 800000FF are used which are not user accessible. The interface has been optimized and uses 32-bit transfers. The overhead of the interface has been reduced to about 16-20 clock cycles. For some operations on the 387 'clones', that take less than 16 clock cycles to complete this effectively limits the execution rate of coprocessor instructions. The only sensible solution to provide even higher floating point performance was to integrate the CPU and coprocessor functionality onto the same chip. This is what Intel did with the 80486. The FPU in the 486 also benefits from the instruction pipelining and from the integrated cache. Performance Several computer magazines have published performance comparisons at the application level for the 387 coprocessors and Weitek's ABACUS 3167 and 4167 chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar, Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests, performance improvements for the 387 clones over Intel's 387DX were small to marginal, the clones running the applications no more than 5% to 15% faster than the Intel 387DX. In the test of 3D-Studio, one of the few programs that supports the Weitek Abacus, the Weitek 3167 improved performance by 23% over an Intel 387DX and the 4167 improved performance by 10% over the 486 [1]. The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX coprocessor has a demonstration program that shows the speedup of certain application programs when run with the Intel coprocessor vs. a system with no coprocessor. Application Time w/o 387 Time w/ 387 Speedup Art&Letters 87.0 sec 34.8 sec 150% Quattro Pro 8.0 sec 4.0 sec 100% Wingz 17.9 sec 9.1 sec 97% Mathematica 420.2 sec 337.0 sec 25% The following table is an excerpt from [70]: Application Time w/o 387 Time w/ 387 Speedup Corel Draw 471.0 sec 416.0 sec 13% Freedom Of Press 163.0 sec 77.0 sec 112% Lotus 1-2-3 257.0 sec 43.0 sec 597% The following table is an excerpt from [25]: Application Time w/o 387 Time w/ 387 Speedup Design CAD, Test1 98.1 sec 50.0 sec 96% Design CAD, Test2 75.3 sec 35.0 sec 115% Excel, Test 1 9.2 sec 6.8 sec 35% Excel, Test 1 12.6 sec 9.3 sec 35% The performance statistics below were put together with the help of four widely known numeric benchmarks and two benchmarks developed by me. Three Pascal programs, one FORTRAN program, and two assembly language program were used. The assembly language programs were linked with Turbo-Pascal 6.0 for library support, especially to include the coprocessor emulator of the TP 6.0 run-time library. The Pascal programs were compiled with Turbo Pascal 6.0 from Borland International, a non-optimizing compiler that produces 16-bit code. The FORTRAN program was compiled using MS FORTRAN 5.0, an optimizing compiler that generates 16-bit code. All programs except PEAKFLOP and SAVAGE, which use double extended precision, use double precision variables. Note that using a highly optimizing compiler producing 32-bit code you will see much higher performance for some benchmarks. For example, Intel rates the 33 MHz 386/387DX at 3290 KWhetstones/sec and 0.4 double precision LINPACK MFLOPS [28,29]. The 33 MHz Intel 486 is rated by Intel at 12300 KWhetstones/sec and 1.6 double precision LINPACK MFLOPS [30]. The compilers used in these benchmarks run by the chip vendor are the ones that give the highest performance available. These compilers are in the US$ 1000+ price range. Some of them may be experimental or prereleased versions not available to the general public. The relative performance of one coprocessor to another could vary depending on the code generated by compilers. Non-optimizing compilers tend to generate a high percentage of operations which access variables in memory, while optimizing compiler produce code that contains many operations involving registers. Thus it is well possible that coprocessor A beats coprocessor B running benchmark Z if compiled with compiler C, but B beats A when the same benchmark is compiled using compiler D. All benchmark in this overview were run from floppy under a 'bare-bones' MS-DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made sure no TSR or other program unnecessarily stole computing resources from the benchmarks. Coprocessor performance also depends on the motherboard, or more specifically the chip set used on the motherboard. In [34] and [35] identically configured motherboards using different 386 chip sets were tested. Among other tests a coprocessor benchmark was run which is based on a fractal computation and its execution time recorded. The following tables showing coprocessor performance to vary with the chip set have been copied from these articles in abridged form. Cyrix Cyrix chip set 387+ chip set 83D87 Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0% Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5% ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0% Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHZ 27.38 sec 91.6% This shows that performance of the same coprocessor can vary by up to ~10% depending on the chip set used on your board, at least for 386 motherboards (similar numbers for 286, 386sx, and 486 are unfortunately not available). The benchmarks for this article were run on a board with the Forex chip set, which is one of the fastest 386 chip sets there is, not only with respect to floating-point performance [35]. Description of benchmarks PEAKFLOP is the kernel of a fractal computation. It consists mainly of a tight loop written in assembly code and fine tuned to give maximum performance. All variables are held in the CPU's and coprocessor's registers, so the only memory access is for opcode fetches. The main loop contains three multiplications and five additions/subtractions. This ratio is fairly typical for other floating point intensive programs as well. The whole program fits nicely into even a very small CPU cache. Due to the nature of this program, its MFLOPS rate is hardly to be exceeded by any program that calculates anything useful. Thus the name PEAKFLOP. You will find the source code for PEAKFLOP in appendix B. TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix (a 4x4 matrix). Each vector consists of four double precision values. Multiplying vectors with a matrix is a typical operation in the manipulation (e.g. rotation) of 3D objects which are made up from many vectors decribing the object. This benchmark stresses addition and multiplication as well as memory access. For each vector, 16 multiplications and 12 additions are used. About 256 kByte of data is accessed during the benchmark. TRNSFORM is implemented as an optimized assembler program linked with the Turbo Pascal 6.0 library. For the IIT 3C87, a special version was written that makes use of the special F4X4 instruction available on that coprocessor. F4X4 does a full multiplication of a 4x4 matrix by a 4x1 vector in a single instruction. The full source code for the TRNSFORM program is in appendix B. LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from real floating point extensive programs. Some of these loops are vectorizable, but since we don't deal with vector processors here, this doesn't matter. For this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal 6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement) memory allocation for data was reduced to 64 kB, so all data fits into a single 64 kB segment. The older version of LLL is used here which contains 14 loops. There also exists a newer, more elaborate version consisting of 24 kernels. The kernels in LLL exercise only multiplication and addition. The MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels as reported by the LLL program. LLL and Whetstone results (see below) are reported as returned by my COMPTEST test program in which they have been included as a measure of coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal 6.0 with all 'optimizations' on and using my own run-time library, which gives higher perfor- mance than the one included with TP 6.0. My library is available as TPL60N15.ZIP from garbo.uwasa.fi and ftp-sites that mirror this site. Linpack [5] is a well known floating-point benchmark that also heavily exercises the memory system. Linpack operates on large matrices and takes up about 570 kB in the version used for this test. This is about the largest program size a pure DOS system can accomodate. Linpack was originally designed to estimate performance of BLAS, a library of FORTRAN subroutines that handles various vector and matrix operations. It uses two routines from BLAS which are thought to be typical of the matrix operations used by BLAS. Both routines only use addition/subtraction and multiplication. The FORTRAN source code for Linpack can be obtained from the automated mail server netlib@ornl.gov. Linpack was compiled using MS Fortran 5.0 in the HUGE memory model (which can handle data structures larger than 64 kB) and with compiler switches set for maximum optimization. Linpack repeatedly does the same test. The number reported is the maximum MFLOPS rate returned by Linpack. Linpack MFLOPS ratings for a great number of machines are contained in [6]. This PostScript document is also available from netlib@ornl.gov. Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected about the use of certain control and data structures in programs written in high level languages. Based on these statistics, Whetstone tries to mirror a 'typical' HLL program. Whetstone performance is expressed by how many theoretical 'whetstone' instructions are executed per second. It was originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack, Whetstone not only uses addition and multiplication but exercises all basic arithmetic operations as well as some transcendental functions. Whetstone performance depends on the speed of the coprocessor as well as on the speed of the CPU, while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU. There exists an old and a new version of Whetstone. Note that results from the two versions can differ by as much as 20% for the same test configuration. For this test, the new version in Pascal from [3] was used. It was compiled with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations' on. SAVAGE tests the performance of transcendental function evaluation. It is basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt functions are combined in a single expression. While sin, cos, arctan, and sqrt can be evaluated directly with a single 387 coprocessor instruction each, ln and exp need additional preprocessing for argument reduction and result conversion. According to [14], the Savage benchmark was devised by Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed to make 250,000 passes though the loop. Here only 10,000 loops are executed for a total of 60,000 transcendental function evaluations. The result is expressed in function evaluations per second. SAVAGE source code was taken from [7] and compiled with Turbo Pascal 6.0 and my own run-time library (see above). Benchmark results for 387 coprocessors, coprocessor emulators and the Intel RapidCAD and Intel 486 CPUs. 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec 386, EM87 0.0084 0.0080 0.0060 0.0060 31 502 ## 386, Franke387 0.0369 0.0295 0.0233 0.0215 164 4002 $$ 386, TP 6 Emu 0.0316 0.0273 0.0200 0.0190 160 3794 %% Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677 ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926 IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766 IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 ?? Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322 Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957 Intel 486 2.4762 2.1335 1.1110 0.8204 6195 98522 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec 386, EM87 0.0070 0.0040 0.0050 0.0050 26 418 ## Franke387 0.0307 0.0246 0.0194 0.0179 137 3335 $$ 386, TP 6 Emu 0.0263 0.0227 0.0167 0.0158 133 3160 %% Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860 ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431 IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020 IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 ?? Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890 Cyrix 83D87 1.013 N/A 0.333 0.273 2550 N/A Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464 Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192 For comparison: PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934 i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203 i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++ i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 && i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !! i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 ** HW configuration for test of 387 coprocessors and Intel RapidCAD: System A: Motherboard with Forex chip set, 128 kB CPU Cache, 8 MB RAM HW configuration for test of 486 FPU (extra fan for 40 MHz operation): System B: Motherboard with SIS chip set, 256 kB CPU Cache, 8 MB RAM ## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that loads as a TSR. It uses INT 7 traps emitted by 80286, 80386 systems with no coprocessor upon encountering coprocessor instructions to catch coprocessor instructions and emulate them. Whetstone and Savage benchmarks for this test were compiled with the original TP 6.0 library, as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my own library if a 387 is detected. Obviously EM87 identifies itself as a 387, but has no support for 387 specific instructions. $$ Franke387 is a commercial 387 emulator that is also available in a shareware version. For this test, shareware version V2.4 was used. Franke387 unlike many other emulators supports all 387 instructions. It is loaded as a device driver and uses INT 7 to trap coprocessor instructions. %% These benchmarks were run using the built-in coprocessor emulators of the TP 6.0 and the MS FORTRAN 5.0 run-time libraries. ?? The 3C87 specific F4X4 instruction was used in the vector trans- formation benchmark. ++ Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB RAM && System A, CPU cache disabled via extended set-up, turbo-switch set to half speed (that is, 20 MHz) !! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the fast CPU used here, performance figures are somewhat higher than can be expected for a 80286/287 combination, except for the PEAKFLOP benchmark, which is basically coprocessor limited ** 8086/8087 system with 640 kB RAM Since neither a Weitek coprocessor nor a compiler that generates code for the Weitek chips were available, performance data for the Weitek Abacus are given here according to [31,32] and scaled to show performance of a 33 MHz system. The benchmarks were compiled using highly optimizing 32-bit compilers. Single Prec. Double Prec. Double Prec. 3167 4167 3167 4167 387 486 Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6 Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300 Note that for the Intel coprocessors, running programs in single vs. double precision doesn't provide much of an performance advantage since all internal calculations are always done in extended precision. Using Weitek coprocessors however, performance nearly doubles when switching fron double to single precision. For double precision calculations using only basic arithmetic, the Weitek Abacus can provide performance at twice the level of the respective Intel coprocessor (387/486) clocked at the same speed at most. Speed of various coprocessor instructions measured in clock cycles as measured with my program 87TIMES. Error is +/- one clock cycle, except for the Intel 80287. Times for the 80287 were determined on a system with a 20 MHz 80386 and a 5 MHz Intel 80287. Therefore, times may differ from a genuine 80286/287 system, especially for those instructions that access an operand in memory. Since the times are stated as the number of coprocessor clock cycles used, the faster 386 which can execute four clock cycles where the 80287 executes one clock cycle may decrease memory access times as seen by the coprocessor. Intel Intel Cyrix Cyrix ULSI IIT Intel Intel i486 RapidCAD 387+ 83D87 83C87 3C87 387DX 80387 FLD1 | 5 7 17 17 17 22 27 35 FLDZ | 5 7 17 17 17 22 22 29 FLDPI | 8 9 17 17 17 22 37 45 FLDLG2 | 8 9 17 17 17 22 37 44 FLDL2T | 8 9 17 17 17 22 37 44 FLDL2E | 8 9 17 17 17 22 37 44 FLDLN2 | 8 9 17 17 17 22 37 45 FLD ST(0) | 5 7 17 17 17 22 17 24 FST ST(1) | 4 7 17 17 17 17 17 24 FSTP ST(0) | 5 7 17 17 17 18 23 25 FSTP ST(1) | 5 7 17 17 17 17 23 25 FLD ST(1) | 5 7 17 17 17 22 17 25 FXCH ST(1) | 5 7 17 17 17 22 22 25 FILD [Word] | 13 16 35 36 41 46 46 65 FILD [DWord] | 12 17 30 30 37 37 40 51 FILD [QWord] | 13 20 40 40 47 47 45 66 FLD [DWord] | 7 13 30 36 32 37 25 35 FLD [QWord] | 7 15 40 44 42 47 35 45 FLD [TByte] | 10 19 52 52 52 57 57 61 FBLD [TByte] | 83 91 84 66 145 205 70 278 FIST [Word] | 32 34 43 42 45 54 72 92 FIST [DWord] | 33 35 48 44 48 57 74 91 FST [DWord] | 11 14 44 42 49 41 46 47 FST [QWord] | 16 18 56 54 60 53 58 60 FISTP [Word] | 32 35 43 42 45 49 73 93 FISTP [DWord] | 34 37 48 44 48 52 75 88 FISTP [QWord] | 35 37 57 53 61 63 86 96 FSTP [DWord] | 12 13 44 42 48 37 46 42 FSTP [QWord] | 16 17 56 55 60 50 59 57 FSTP [TByte] | 14 16 59 58 58 56 67 70 FBSTP [TByte] | 171 175 101 98 126 216 147 535 FINIT | 18 35 18 18 18 18 19 25 FCLEX | 8 24 18 18 18 18 19 25 FCHS | 8 11 17 17 17 17 31 35 FABS | 6 8 17 17 17 17 28 31 FXAM | 13 15 17 17 17 17 37 40 FTST | 5 7 22 17 22 22 32 35 FSTENV | 68 85 127 127 135 127 162 169 FLDENV | 45 62 109 109 123 109 122 132 FSAVE | 160 172 359 359 366 377 467 504 FRSTOR | 131 206 361 361 369 367 424 453 FSTSW [mem] | 4 7 16 16 17 16 17 22 FSTSW AX | 4 7 14 14 14 14 14 17 FSTCW [mem] | 4 7 16 16 16 16 16 22 FLDCW [mem] | 5 14 28 28 29 29 29 34 FADD ST,ST(0) | 8 9 22 17 17 22 27 30 FADD ST,ST(1) | 9 10 22 17 17 22 22 34 FADD ST(1),ST | 10 10 22 17 17 22 23 35 FADDP ST(1),ST | 11 11 22 17 17 22 23 34 FADD [DWord] | 9 14 30 30 33 32 31 42 FADD [QWord] | 9 16 40 40 43 42 41 51 FIADD [Word] | 20 21 36 36 43 43 49 77 FIADD [DWord] | 20 25 30 30 38 38 43 65 FSUB ST(1),ST | 10 10 22 17 17 22 23 35 FSUBR ST(1),ST | 9 10 22 17 20 25 27 35 FSUBRP ST(1),ST | 10 10 22 17 17 22 23 35 FSUB [DWord] | 11 14 30 30 32 32 30 41 FSUB [QWord] | 11 16 40 40 42 43 40 51 FISUB [Word] | 21 21 36 36 44 43 56 77 FISUB [DWord] | 21 25 30 30 39 38 43 65 FMUL ST,ST(1) | 16 17 22 22 22 27 38 56 FMUL ST(1),ST | 16 17 22 22 22 27 40 60 FMULP ST(1),ST | 16 17 22 22 22 27 38 59 FIMUL [Word] | 22 23 36 36 50 43 50 77 FIMUL [DWord] | 22 25 36 36 45 38 46 73 FMUL [DWord] | 11 14 36 36 32 38 31 48 FMUL [QWord] | 14 16 46 46 42 48 41 72 FDIV ST,ST(0) | 73 74 38 23 52 57 92 95 FDIV ST,ST(1) | 73 74 42 36 52 57 78 95 FDIV ST(1),ST | 73 74 42 36 52 57 78 99 FDIVR ST(1),ST | 73 74 42 36 53 57 77 100 FDIVRP ST(1),ST | 73 74 42 36 52 57 78 101 FIDIV [Word] | 84 85 61 54 79 73 105 144 FIDIV [DWord] | 84 85 54 47 74 68 101 129 FDIV [DWord] | 73 74 54 48 63 62 78 100 FDIV [QWord] | 73 74 64 57 72 72 79 113 FSQRT (0.0) | 26 28 17 17 17 22 27 35 FSQRT (1.0) | 83 84 72 36 87 57 112 128 FSQRT (L2T) | 86 87 72 36 87 57 102 133 FXTRACT (L2T) | 17 17 22 17 32 76 56 68 FSCALE (PI,5) | 30 31 22 36 47 77 57 80 FRNDINT (PI) | 31 31 27 19 32 27 47 74 FPREM (99,PI) | 58 60 102 52 57 52 77 100 FPREM1(99,PI) | 90 91 102 57 62 52 102 119 FCOM | 5 7 17 17 27 17 27 34 FCOMP | 6 7 17 17 27 17 28 35 FCOMPP | 7 8 17 17 27 22 28 34 FICOM [Word] | 16 20 36 36 49 37 61 77 FICOM [DWord] | 18 25 30 30 44 32 48 61 FCOM [DWord] | 7 14 30 30 33 32 31 35 FCOM [QWord] | 7 15 40 40 43 42 41 51 FSIN (0.0) | 25 27 97 17 17 22 37 45 FSIN (1.0) | 310 314 162 116 492 222 512 593 FSIN (PI) | 88 90 187 121 67 217 132 155 FSIN (LG2) | 284 288 84 73 445 184 434 505 FSIN (L2T) | 299 303 177 121 472 217 452 533 FCOS (0.0) | 25 27 157 17 22 22 37 44 FCOS (1.0) | 302 306 107 87 487 212 457 540 FCOS (PI) | 89 92 257 151 62 222 197 230 FCOS (LG2) | 300 304 152 106 452 192 502 584 FCOS (L2T) | 307 311 242 156 467 222 507 598 FSINCOS (0.0) | 26 29 17 17 22 31 41 54 FSINCOS (1.0) | 353 357 172 126 492 416 536 637 FSINCOS (PI) | 105 107 262 161 67 421 226 273 FSINCOS (LG2) | 340 344 157 116 457 361 531 628 FSINCOS (L2T) | 347 351 247 166 472 421 536 643 FPTAN (0.0) | 26 28 17 17 22 31 36 43 FPTAN (1.0) | 267 269 147 121 537 306 322 392 FPTAN (PI) | 145 146 227 136 112 306 167 212 FPTAN (LG2) | 244 246 132 91 502 276 297 363 FPTAN (L2T) | 247 249 217 136 517 306 297 363 FPATAN (0.0) | 39 41 27 22 22 27 97 92 FPATAN (1.0) | 294 298 157 121 372 602 358 433 FPATAN (PI) | 304 307 192 143 357 422 378 468 FPATAN (LG2) | 289 293 157 126 362 382 373 447 FPATAN (L2T) | 304 307 192 141 362 422 373 463 F2XM1 (0.0) | 26 28 17 17 17 22 37 38 F2XM1 (LN2) | 209 212 122 86 392 287 297 348 F2XM1 (LG2) | 204 207 107 76 377 287 292 340 FYL2X (1.0) | 60 60 42 36 72 92 112 127 FYL2X (PI) | 294 297 162 111 452 357 393 497 FYL2X (LG2) | 311 314 162 106 457 337 408 512 FYL2X (L2T) | 293 296 162 111 437 357 393 496 FYL2XP1 (LG2) | 334 337 167 101 462 282 433 533 80386 + 80386 + 80386 + Intel Intel Franke387 TP 6.0 EM87 8087 80287 Emulator Emulator Emulator FSTP ST(0) | 26 54 507 358 2115 FLD1 | 26 55 481 422 1626 FLDZ | 21 53 480 416 1646 FLDPI | 26 55 486 443 1626 FLDLG2 | 26 56 486 423 1626 FLDL2T | 26 55 486 440 1626 FLDL2E | 26 53 486 423 1626 FLDLN2 | 26 55 486 441 1626 FLD ST(0) | 31 55 493 362 1851 FST ST(1) | 26 54 489 355 1931 FSTP ST(1) | 21 55 507 356 2116 FLD ST(1) | 26 55 493 362 1852 FXCH ST(1) | 21 57 497 486 2187 FILD [Word] | 58 90 667 712 2259 FILD [DWord] | 64 74 608 812 2164 FILD [QWord] | 74 93 652 707 2971 FLD [DWord] | 49 44 633 473 2077 FLD [QWord] | 54 57 641 524 2336 FLD [TByte] | 59 45 607 492 2063 FBLD [TByte] | 309 310 2019 1512 17827 FIST [Word] | 79 72 854 766 2418 FIST [DWord] | 84 80 865 518 2325 FST [DWord] | 89 85 686 441 2200 FST [QWord] | 99 92 703 516 2481 FISTP [Word] | 79 80 864 794 2620 FISTP [DWord] | 79 81 879 541 2523 FISTP [QWord] | 88 75 904 916 3226 FSTP [DWord] | 89 75 713 467 2400 FSTP [QWord] | 93 72 732 538 2678 FSTP [TByte] | 49 21 685 467 2124 FBSTP [TByte] | 528 472 3305 1555 27013 FINIT | 11 10 742 641 1369 FCLEX | 11 10 440 323 912 FCHS | 21 54 460 354 1744 FABS | 21 54 456 349 1738 FXAM | 21 54 481 380 1551 FTST | 51 75 585 386 2721 FSTENV | 54 57 928 519 2104 FLDENV | 48 50 1125 450 1631 FSAVE | 214 244 1949 976 2749 FRSTOR | 209 227 2182 657 2225 FSTSW [mem] | 28 10 516 401 1189 FSTSW AX | N/A 55 451 N/A N/A FSTCW [mem] | 28 10 506 359 1167 FLDCW [mem] | 19 47 524 437 1584 FADD ST,ST(0) | 86 128 643 706 2805 FADD ST,ST(1) | 85 116 707 808 3093 FADD ST(1),ST | 92 131 664 812 3146 FADDP ST(1),ST | 92 129 704 799 3143 FADD [DWord] | 105 122 874 969 3139 FADD [QWord] | 115 122 888 1021 3396 FIADD [Word] | 115 122 940 1211 3330 FIADD [DWord] | 125 122 882 1297 3215 FSUB ST(1),ST | 88 130 738 817 3156 FSUBR ST(1),ST | 96 132 740 868 3004 FSUBRP ST(1),ST | 99 132 733 805 3301 FSUB [DWord] | 119 122 918 1018 3127 FSUB [QWord] | 129 123 932 1070 3632 FISUB [Word] | 115 123 977 1081 3802 FISUB [DWord] | 125 125 940 980 4161 FMUL ST,ST(1) | 145 151 810 1368 3924 FMUL ST(1),ST | 145 151 817 1377 3962 FMULP ST(1),ST | 148 168 840 1365 4164 FIMUL [Word] | 132 151 1039 1517 4039 FIMUL [DWord] | 141 151 980 1643 3976 FMUL [DWord] | 125 123 948 1480 3445 FMUL [QWord] | 175 192 991 1602 4416 FDIV ST,ST(0) | 201 207 726 1536 9789 FDIV ST,ST(1) | 203 218 808 1658 10332 FDIV ST(1),ST | 207 214 825 1655 10342 FDIVR ST(1),ST | 201 206 819 1806 10213 FDIVRP ST(1),ST | 201 205 845 1803 10409 FIDIV [Word] | 237 227 980 1779 11225 FIDIV [DWord] | 246 227 944 1680 11572 FDIV [DWord] | 229 226 893 1722 10577 FDIV [QWord] | 236 227 993 1777 10829 FSQRT (0.0) | 21 57 512 382 1755 FSQRT (1.0) | 186 206 1106 2504 37836 FSQRT (L2T) | 186 207 1398 2467 37925 FXTRACT (L2T) | 51 56 726 571 3326 FSCALE (PI,5) | 41 56 817 443 3194 FRNDINT (PI) | 51 58 808 800 7092 FPREM (99,PI) | 81 131 1696 941 4098 FPREM1(99,PI) | N/A N/A 1625 N/A N/A FCOM | 56 75 582 483 2799 FCOMP | 61 92 616 485 2983 FCOMPP | 61 90 661 476 3198 FICOM [Word] | 79 77 808 861 3654 FICOM [DWord] | 89 77 750 964 3684 FCOM [DWord] | 74 75 741 625 3643 FCOM [QWord] | 74 76 754 667 3771 FSIN (0.0) | N/A N/A 639 N/A N/A FSIN (1.0) | N/A N/A 4640 N/A N/A FSIN (PI) | N/A N/A 2488 N/A N/A FSIN (LG2) | N/A N/A 3911 N/A N/A FSIN (L2T) | N/A N/A 3767 N/A N/A FCOS (0.0) | N/A N/A 740 N/A N/A FCOS (1.0) | N/A N/A 4777 N/A N/A FCOS (PI) | N/A N/A 2557 N/A N/A FCOS (LG2) | N/A N/A 4176 N/A N/A FCOS (L2T) | N/A N/A 3905 N/A N/A FSINCOS (0.0) | N/A N/A 714 N/A N/A FSINCOS (1.0) | N/A N/A 6049 N/A N/A FSINCOS (PI) | N/A N/A 4091 N/A N/A FSINCOS (LG2) | N/A N/A 5640 N/A N/A FSINCOS (L2T) | N/A N/A 5405 N/A N/A FPTAN (0.0) | 41 58 752 8381 2324 FPTAN (1.0) | 581 582 6366 10817 29824 FPTAN (PI) | 606 587 4388 12410 2300 FPTAN (LG2) | 516 513 5939 12502 26770 FPTAN (L2T) | 576 586 5723 12483 2301 FPATAN (0.0) | 41 55 616 1208 10578 FPATAN (1.0) | 736 736 1426 13446 34208 FPATAN (PI) | 206 207 12835 13305 46903 FPATAN (LG2) | 756 736 12490 13319 41312 FPATAN (L2T) | 206 204 12922 13364 50149 F2XM1 (0.0) | 16 56 563 723 1722 F2XM1 (LN2) | 631 624 4178 11070 33823 F2XM1 (LG2) | 611 585 4798 11116 32163 FYL2X (1.0) | 56 57 961 1214 4327 FYL2X (PI) | 946 961 8987 12858 40148 FYL2X (LG2) | 1081 1038 8933 12748 46821 FYL2X (L2T) | 926 886 8982 12712 38986 FYL2XP1 (LG2) | 1026 1037 10485 11867 44708 The Weitek 3167 and 4167 processors only implement the basic arithmetic functions (add, subtract, multiply, divide, square root) in hardware. Transcendental functions are implemented by means of a software library supplied by Weitek that uses the Weitek hardware to approximate the transcendental functions with polynomial and rational approximations. The clock cycle timings for the transcendental functions are average values, since execution time differs with the value of argument. The speed of transcendental functions for the 4167 is estimated based on the numbers in [31,33], from which this timing information has been extracted. Execution time for floating-point operations in clock cycles on Weitek coprocessors Single Precision Double Precision 3167 4167 3167 4167 ABS 3 2 3 2 NEG 6 2 6 2 ADD 6 2 6 2 SUB 6 2 6 2 SUBR 6 2 6 2 MUL 6 2 10 3 DIVR 38 17 66 31 SQRT 60 17 118 31 SIN 146 ~50 292 ~100 COS 140 ~50 285 ~100 TAN 188 ~60 340 ~110 EXP 179 ~60 401 ~130 LOG 171 ~60 365 ~120 F->ASCII 1000 N/A 1700 N/A // ASCII->F 1100 N/A 1800 N/A // // rough average of the timings given for different numeric formats by Weitek. Note that these conversions routines do much more work than the FBLD and FBSTP instructions provided by the 80x87 coprocessors. FBLD and FBSTP are useful for conversion routines but quite a bit of additional code is need for this purpose. Accuracy The IEEE-754 Standard for Binary Floating-Point Arithmetic [10,11] is fully implemented by Intel's 387 coprocessor [17]. Among other things, this means that the add, subtract, multiply, divide, remainder, and square root operations always deliver the 'exact' result. By exact it is meant that the coprocessor always delivers the machine number closest to the real result, which may not be representable exactly in the available numeric format. The 80387 implements the single, double, and double extended formats as specified in the standard as well as all functions required by it [17]. Note that earlier Intel coprocessors (the 8087 and the 80287) comply with a draft version of the standard that differs from the final version. These chips came out before the IEEE-754 standard was finally accepted in 1985. As in the 80387, the basic arithmetic in the 8087 and the 80287 is exact in the sense that the computed result is always the machine number closest to the real result. However, there are some differences regarding certain operands like infinities and some operation like the remainder are defined differently. Some instructions have been added in the 80387, most notably the FSIN and FCOS operations. The argument range for some transcendental function has been extended [17]. Note that the IEEE-754 standard says nothing about the quality of the implementation of transcendental functions like sin, cos, tan, arctan, log. Intel uses a modified CORDIC [18,19] technique to compute the transcendental functions. Intel claims that maximum error in the 8087, 80287, and 80387 for all transcendental functions does not exceeed two bits in the mantissa of the double extended format, which features 64 mantissa bits for an accuracy of approximately 19 decimal places [22,23]. This claim has been independently verified by a competing vendor [13]. This means that at least 62 of the 64 mantissa bits in a transcendental function result are correct. The Weitek Abacus 3167 and 4167 are 'mostly compatible' with IEEE-754 [31,32,33]. It supports the single precision and double precision numeric formats formats described in the standard as well as the four rounding modes required by it. However, due to the need for extremely high speed operation, some of the finer points of IEEE-754 have not been implemented. One of the most notable omissions is the missing support for denormal numbers. Denormals are always flushed to zero. The 387 clone makers claim 100% compatibility with Intel's 80387. So one would expect the same accuracy from their chips. For example, on the packaging of the IIT 3C87 it says that ".. the requirements of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states that their 83D87 complies fully with the IEEE-754 standard [12]. Cyrix delivers with their copocessors some diagnostic software. This includes the program IEEETEST which is based on the IEEE test vectors from the Ph.D. thesis of Jerome T. Coonen [9]. A test using the IEEE test vectors has also been included into the RUNDIAG program on the Intel RapidCAD diagnostic disk. Rather than performing random tests, the test vectors check specific cases that may be hard to get right. Each test vector specifies the operation to be performed, the operands, precision and rounding mode to be used, and the result (including flags set) to be expected according to IEEE-754. I ran IEEETEST on all the available coprocessors/ FPUs. The Intel 486, Intel RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87, and the Cyrix 387+ passed with no errors. The ULSI 83C87 showed some minor flaws in the FCOM, FDIV, FMUL, and FSCALE operations, getting flag errors in about 1% of the tested cases, but no computational errors. However, for the IIT 3C87, the IEEETEST program showed flag *and* some computational errors (that is, wrong results) for all tested operations except FXTRACT and FCHS. The Intel 80287 shows numerous errors, but this it not surprising, since the 80287 does not comply with IEEE-754 but with an earlier draft of that standard, so it does some thing differently than required by the final version of the standard. Although IEEETEST is written in Turbo Pascal, the coprocessor emulator in the TP 6.0 library could not be tested since IEEETEST was compiled with the $E- switch excluding the emulator from program code. The public domain emulator EM87 could be tested, but hung in the last test which checks the implementation of the remainder operation. This is probably caused by some bug in the emulation of the FPREM instruction tested in this test. It is interesting to note how the error profile of EM87 matches exactly that of the Intel 80287, so it can be assumed that EM87 is a very good emulation of the 80287. The Franke387 V2.4 emulator hung in the division test quite early in IEEETEST. The tests performed up to the division test reported several errors. Explanatory text printed at the start of the IEEETEST program: JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities as a member of the floating-point working group that defined the IEEE 754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of his thesis presents FPTEST, a Pascal program written by J Thomas and JT Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math coprocessor accepts 80387 compatible floating-point instructions. IEEETEST reads test vectors from the file TESTVECS and compares the answer returned by the math coprocessor with the answer listed in the test vector. If these answers differ an 'F' is displayed, otherwise a '.'is displayed. Answers can differ due to two types of failures: numeric failures or flag failures. Numeric failures occur when the computed answer has the wrong value. Flag failures occur when the status (invalid operation, divide by zero, underflow, overflow, inexact) is incorrectly identified. TESTVECS is the concatenation of unmodified versions of all the test vectors distributed by UC Berkeley. The test data base is copyrighted by UC Berkeley (1985) and is being distributed with their permission. FPTEST and the test data base can be obtained by asking for 'IEEE-754 Test Vector' from UC Berkeley, Electrical Engineering and Computer Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720 (415)643-6687. The initial version of this test data base for the proposed IEEE 754 binary floating-point standard (draft 8.0) was developed for Zilog, Inc. and was donated to the floating-point working group for dissemination. Errors in or additions to the distributed data base should be reported to the agency of distribution, with copies to Zilog, Inc., 1315 Dell Avenue, Campbell, CA, 95008. IEEETEST output for Intel 80387, Intel 387DX, Intel 486, Cyrix 83D87, Cyrix 387+, RapidCAD IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4320 0 | 0 0 0 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4311 0 | 0 0 0 | 0 0 0 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3978 0 | 0 0 0 | 0 0 0 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2832 0 | 0 0 0 | 0 0 0 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 948 0 | 0 0 0 | 0 0 0 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31235 0 | IEEETEST output for ULSI 83C87 IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 3528 0 | 0 0 0 | 0 0 0 Comparison C | 4312 8 | 0 0 0 | 0 0 8 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4250 61 | 0 0 0 | 28 28 5 Fraction Part F | 624 0 | 0 0 0 | 0 0 0 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3936 42 | 0 0 0 | 19 19 4 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 2828 4 | 0 0 0 | 0 0 4 Round to Integer I | 558 0 | 0 0 0 | 0 0 0 Scalb S | 930 18 | 0 0 0 | 6 6 6 Square Root V | 744 0 | 0 0 0 | 0 0 0 Subtraction - | 3528 0 | 0 0 0 | 0 0 0 Remainder % | 2984 0 | 0 0 0 | 0 0 0 Totals | 31102 133 | IEEETEST output for IIT 3C87 IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 200 16 | 0 0 16 | 0 0 0 Addition + | 3336 192 | 0 0 128 | 0 0 96 Comparison C | 4224 96 | 0 0 96 | 0 0 0 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 4159 152 | 0 0 124 | 0 0 116 Fraction Part F | 600 24 | 0 0 24 | 0 0 24 Logb L | 960 0 | 0 0 0 | 0 0 0 Multiplication * | 3702 276 | 0 0 248 | 0 0 100 Negation - | 200 16 | 0 0 16 | 0 0 0 Next After N | 2248 584 | 0 0 584 | 0 0 168 Round to Integer I | 542 16 | 0 0 4 | 0 0 16 Scalb S | 874 74 | 5 5 44 | 8 8 20 Square Root V | 688 56 | 0 0 56 | 0 0 56 Subtraction - | 3336 192 | 0 0 128 | 0 0 96 Remainder % | 2844 140 | 0 0 140 | 0 0 116 Totals | 29401 1834 | IEEETEST output for Intel 80287 run together with a 80386 CPU IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 348 2484 | 768 768 768 | 504 504 526 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 Remainder % | 708 2276 | 768 768 560 | 216 216 216 Totals | 18850 12385 | IEEETEST output for EM87 coprocessor emulator run on a Intel 386 CPU IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended | TESTS | numeric TYPE OF FAILURE flag Operation Code | Passed Failed | S D E | S D E ---------------------------------------------------------------------- Absolute Value A | 216 0 | 0 0 0 | 0 0 0 Addition + | 2886 642 | 16 16 112 | 174 174 174 Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332 Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0 Division / | 3777 534 | 18 18 37 | 169 169 165 Fraction Part F | 552 72 | 24 24 24 | 24 24 24 Logb L | 900 60 | 12 12 12 | 20 20 20 Multiplication * | 2944 1034 | 105 105 197 | 303 303 231 Negation - | 216 0 | 0 0 0 | 0 0 0 Next After N | 348 2484 | 768 768 768 | 504 504 526 Round to Integer I | 546 12 | 0 0 0 | 4 4 4 Scalb S | 663 285 | 45 43 26 | 102 98 46 Square Root V | 720 24 | 4 4 4 | 8 8 8 Subtraction - | 2886 642 | 16 16 112 | 174 174 174 To complement the checks done by IEEETEST I wrote some short programs DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test the following features: 1. support for denormals in all precisions (single, double, extended) 2. support for the four IEEE rounding modes (up, down, nearest, chop) 3. support for precision control Note that 1) and 2) are required for IEEE conformance, while 3) is required for compatibility with Intel's coprocessors. Precision control forces the results of the FADD, FSUB, FMUL, FDIV, and FSQRT instruction to be rounded to the specified precision (single, double, double extended). This feature is provided to obtain compatibility with certain programming languages [17]. By specifying lower precision, one effectively nullifies the advantages of extended precision intermediate results. The programs that test precision control and rounding control are designed to return a different result for each of the modes for the same sequence of operation. The source code of the programs can be found in appendix A. The Intel 8087 and 80287 were not tested with DENORMTS since Turbo Pascal does not support extended precision denormals on 8087/80287 processors, so the denormal test fails anyway. The 8087 and 287 pass the RCTRL and PCTRL tests, though. These are the results for the Intel 387, Intel 387DX, Intel 486, Intel RapidCAD, Cyrix 83D87, Cyrix 387+, and the EM87 emulator (on a 80386 machine) Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934 These are the results for the ULSI 83C87 Precision Control SINGLE 1.23456789012337585E+0000 DOUBLE 1.23456789012337585E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals supported EXTENDED denormal prints as: 1.31640625000000E-4934 Denormal should be printed as 1.3164...E-4934 These are the results for the IIT 3C87 Precision Control SINGLE 1.13311278820037842E+0000 DOUBLE 1.23456789006442125E+0000 EXTENDED 1.23456789012337585E+0000 Rounding Control NEAREST -1.23427629010100635E+0100 DOWN -1.23427623555772409E+0100 UP -1.23457760966801097E+0100 CHOP -1.23397493540770643E+0100 Denormal support SINGLE denormals supported SINGLE denormal prints as: 4.60943116855005E-0041 Denormal should be printed as 4.60943...E-0041 DOUBLE denormals supported DOUBLE denormal prints as: 8.75000000000016E-0311 Denormal should be printed as 8.75...E-0311 EXTENDED denormals not supported These are the results for the TP 6.0 coprocessor emulator: Precision Control SINGLE 1.23456789012351396E+0000 DOUBLE 1.23456789012351396E+0000 EXTENDED 1.23456789012351396E+0000 Rounding Control NEAREST -1.23457766383395931E+0100 DOWN -1.23457766383395931E+0100 UP -1.23457766383395931E+0100 CHOP -1.23457766383395931E+0100 Denormal support SINGLE denormals not supported DOUBLE denormals not supported EXTENDED denormals not supported The test results show that the IIT 3C87 does not conform to the IEEE-754 floating-point standard in that it does not support denormals in double extended precision. The ULSI 83C87 is not Intel 387 compatible in that it does not support precision control, but allways uses double extended precision. The TP 6.0 emulator supports neither precision control, rounding control nor support for any denormals. In addition, its basic arithmetic operations do not seem to conform to the IEEE standard as the results of the test programs differ from that of any result computed by a coprocessor for any mode. With regard to the accuracy of transcendental functions, Cyrix claims that the relative error of the transcendental functions on the 83D87 never exceeds 0.5 units in the last place (0.5 ULP) of the double extended format [13]. This means that the maximum relative error is below 2**-64, while Intel's published error limit is 2**-62. While Intel uses a modified CORDIC algorithm [18,19] to compute the transcendental functions, Cyrix uses rational approximations that utilize a very fast array multiplier. For an explanation why this approach is superior to CORDIC with todays technology, see [61]. Also, Cyrix uses an internal 75 bit data path for the mantissa [15], so intermediate computations in the generation of transcendental function values will enjoy some additional accuracy over the 64 bits provided by the double extended format. Using 75 mantissa bits also provides an advantage over other coprocessors like the Intel 387DX and ULSI 83C87 which use only a 68 bit data path for the mantissa [58,59]. Note that a maximum relative error of 0.5 ULP for the Cyrix coprocessor does not mean that it returns the 'exact' result (machine number closest to infinitely precise result) all the time. Just consider the case where the infinitely precise result of a transcendental function falls nearly half way between two machine numbers. A relative error of 0.5 ULP can cause the result to be either of the numbers after rounding, depending on the direction of the error. But the 83D87 should deliver results that never differ from the 'exact' result by more than one ULP. Cyrix also claims that its transcendental functions satisfy the monotonicity criterion [13], a claim not made by any of the competitors. Monotonicity means that for all x1 > x2, it always follows that f(x1) >= f(x2) for an increasing function like sin on [0..pi/4]. Likewise, for a decreasing function like cos on [0..pi/4], for all x1 > x2, it follows that f(x1) <= f(x2). The Weitek Abacus 3167 and 4167 implement only the basic arithmetic operations (add, subtract, negate, multiply, divide, square root) in hardware. Transcendental functions are provided via a software library provided by Weitek. For these library functions Weitek claims a maximum relative error of 5 ULPs [31,33] (ULP = Unit in the Last Place, numeric weight of the least significant mantissa bit). This means that the last three bits in the mantissa of a double precision result can be wrong. Note that the Intel 387 and compatible math coprocessors generate the transcendental functions with a small relative error with regard to the _extended double precision_ format. Thus, when rounded to double precision, their function values are nearly always 'exact'. 387 type coprocessors have superior accuracy when compared with Weitek's coprocesssors. The test diskette distributed with early versions of the Cyrix 83D87 contained a program TRANCK that checks the accuracy of the transcendental functions in the coprocessor against a more precise software arithmetic [16]. I used this program to compare the accuracy of the transcendental functions on those 287/387/486 coprocessors/FPUs available to me. As TRANCK will not accept negative numbers as intervall limits, I tested each function on an intervall along the positive x-axis. The functions tested are F2XM1 (2**x-1), FSIN (sine), FCOS (cosine), FPTAN (tangent), FPATAN (arctangent), FYL2X (y * log2 (x)), and FYL2XP1 (y * log2 (x+1)). These are all the transcendental functions implemented on the 80387. Note that the square root (FSQRT) is *not* a transcendental function. For every function, 100,000 arguments were evaluated. The arguments were uniformally distributed within the intervall tested. The EM87 emulator could not be checked with TRANCK, since the multiple precision package in TRANCK would always return with an error message immediately. However, the Franke387 could be tested and Test results for accuracy of transcendental functions for double extended precision as returned by the program TRANCK. 100,000 trials per function. %wrong is the percentage of results that differ from the 'exact' result (infinitely precise result rounded to 64 bits) ULP_hi is the number of results where the returned result was greater than the 'exact' (correctly rounded) result by one ULP (the numeric weight of the last mantissa bit, 2**-64 to 2**-63 depending of the size of the number). ULPs_hi is the number of results where the returned result was greater than the 'exact' result by two or more ULPs. ULP_lo is the number of results where the returned result was smaller than the 'exact' (correctly rounded) result by one ULP (the numeric weight of the last mantissa bit, 2**-64 to 2**-63 depending of the size of the number). ULPs_lo is the number of results where the returned result was smaller than the 'exact' result by two or more ULPs. max ULP err is the maximum deviation of a returned result from the 'exact' answer expressed in ULPs. Franke387 V2.4 emulator max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 39.042 25301 708 13029 4 2 COS 0,pi/4 75.714 49827 25887 0 0 3 TAN 0,pi/4 76.976 14230 10029 24323 28394 9 ATAN 0,1 55.826 26028 1529 24044 4225 4 2XM1 0,0.5 96.717 0 0 47910 48807 5 YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8 YL2X 0.1,10 62.252 16817 4712 37082 3641 2953 INTEL 80287 max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 N/A N/A N/A N/A N/A N/A COS 0,pi/4 N/A N/A N/A N/A N/A N/A TAN 0,pi/4 37.001 18756 524 17405 316 2 ATAN 0,1 9.666 6065 0 3601 0 1 2XM1 0,0.5 19.920 0 0 19920 0 1 YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1 YL2X 0.1,10 1.287 723 0 564 0 1 INTEL 387 max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 28.872 2467 0 26392 13 2 COS 0,pi/4 27.213 27169 35 9 0 2 TAN 0,pi/4 10.532 441 0 10091 0 1 ATAN 0,1 7.088 2386 0 4691 1 2 2XM1 0,0.5 32.024 0 0 32024 0 1 YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1 YL2X 0.1,10 13.020 6508 0 6512 0 1 INTEL 387DX max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 28.873 2467 0 26393 13 2 COS 0,pi/4 27.121 27090 22 9 0 2 TAN 0,pi/4 10.711 457 0 10254 0 1 ATAN 0,1 7.088 2386 0 4691 1 2 2XM1 0,0.5 32.024 0 0 32024 0 1 YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1 YL2X 0.1,10 13.020 6508 0 6512 0 1 ULSI 83C87 max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 35.530 4989 6 30238 297 2 COS 0,pi/4 43.989 11193 675 31393 728 2 TAN 0,pi/4 48.539 18880 1015 26349 2295 3 ATAN 0,1 20.858 62 0 20796 0 1 2XM1 0,0.5 21.257 4 0 21253 0 1 YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2 YL2X 0.1,10 13.603 9816 0 3787 0 1 IIT 3C87 max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 18.650 11171 0 7479 0 1 COS 0,pi/4 7.700 3024 0 4676 0 1 TAN 0,pi/4 20.973 9681 0 11291 1 2 ATAN 0,1 19.280 13186 0 6094 0 1 2XM1 0,0.5 25.660 17570 0 8090 0 1 YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3 YL2X 0.1,10 10.888 5638 357 4845 48 3 CYRIX 83D87 max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.554 1015 0 539 0 1 COS 0,pi/4 0.925 143 0 782 0 1 TAN 0,pi/4 4.147 881 0 3266 0 1 ATAN 0,1 0.656 229 0 427 0 1 2XM1 0,0.5 2.628 1433 0 1194 0 1 YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1 YL2X 0.1,10 0.931 256 0 675 0 1 CYRIX 387+ max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 1.486 864 0 622 0 1 COS 0,pi/4 2.072 12 0 2060 0 1 TAN 0,pi/4 0.602 63 0 539 0 1 ATAN 0,1 0.384 12 0 372 0 1 2XM1 0,0.5 1.985 27 0 1958 0 1 YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1 YL2X 0.1,10 0.764 367 0 397 0 1 INTEL RapidCAD, Intel 486 max funct. intervall %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err SIN 0,pi/4 16.991 1517 0 15474 0 1 COS 0,pi/4 9.003 7603 0 1400 0 1 TAN 0,pi/4 10.532 441 0 10091 0 1 ATAN 0,1 7.078 2386 0 4691 1 2 2XM1 0,0.5 32.025 0 0 32025 0 1 YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1 YL2X 0.1,10 3.894 1879 0 2015 0 1 The test results above indicate that all 80x87 compatibles do not exceed Intel's stated error bound of 3 ULPs for the transcendental functions. However, some coprocessors are more accurate than others. Rating the coprocessors according to the accuracy of their trans- cendental functions gives the following list (highest accuracy first): Cyrix 387+, Cyrix 83D87, Intel 486, Intel RapidCAD, Intel 80287(!), Intel 387DX, Intel 80387, IIT 3C87, ULSI 83C87. The tests also show that the problems with excessive inaccuracy of the trans- cendental functions in early versions of the IIT coprocessors with errors of up to 8 ULPs [8] have been eliminated. According to [56], certain problems with the FPATAN instruction on the IIT 3C87 occuring under the UNIX version of AutoCAD have been corrected in June, 1990. The Franke387 has acceptable accuracy for the FSIN, FCOS, and FPATAN instructions, taking into consideration that according to its documentation, Franke387 uses only 64 bits of precision for the intermediate results, while coprocessorsa typically use 68 bits and more. However, the larger error in the FPTAN, F2XM1, FYL2XP1, and especially the FYL2X operations show that the emulator doesn't use state of the art algorithms, which ensure an error of only a very few ULPs even if no extra precise intermediate results are available. References [1] Schnurer, G.: Zahlenknacker im Vormarsch. c't 1992, Heft 4, Seiten 170-186 [2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark. Computer Journal, Vol. 19, No. 1, 1976, pp. 43-49 [3] Wichmann, B.A.: Validation code for the Whetstone benchmark. NPL Report DITC 107/88, National Physics Laboratory, UK, March 1988 [4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years. In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990 [5] Dongarra, J.J.: The Linpack Benchmark: An Explanation. In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990 [6] Dongarra, J.J.: Performance of Various Computers Using Standard Linear Equations Software. Report CS-89-85, Computer Science Department, University of Tennessee, March 11, 1992 [7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test. Design & Elektronik 1990, Heft 13, Seiten 105-110 [8] Ungerer, B.: Sockelfolger. c't 1990, Heft 4, Seiten 162-163 [9] Coonen, J.T.: Contributions to a Proposed Standard for Binary Floating-Point Arithmetic Ph.D. thesis, University of California, Berkeley, 1984 [10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic. SIGPLAN Notices, Vol. 22, No. 2, 1985, pp. 9-25 [11] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-1985. New York, NY: Institute of Electrical and Electronics Engineers 1985 [12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989 Order No. B2004 [13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990 Order No. B2002 [14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990 Order No. B2004 [15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990 Order No. L2001-003 [16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package. ACM Transactions on Mathematical Software, Vol. 4, No. 1, March 1978, pp. 57-70 [17] 387DX User's Manual, Programmer's Reference. Intel Corporation, 1989 Order No. 231917-002 [18] Volder, J.E.: The CORDIC Trigonometric Computing Technique. IRE Transactions on Electronic Computers, Vol. EC-8, No. 5, September 1959, pp. 330-334 [19] Walther, J.S.: A unified algorithm for elementary functions. AFIPS Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385 [20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E mit Vektoreinrichtung. Arbeitsbericht RRZK-8803, Regionales Rechenzentrum an der Universit"at zu Köln, Februar 1988 [21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical performance range. Technical Report UCRL-53745, Lawrence Livermore National Laboratory, USA, December 1986 [22] Nave, R.: Implementation of Transcendental Functions on a Numerics Processor. Microprocessing and Microprogramming, Vol. 11, No. 3-4, March-April 1983, pp. 221-225 [23] Yuen, A.K.: Intel's Floating-Point Processors. Electro/88 Conference Record, Boston, MA, USA, 10-12 May 1988, pp. 48/5-1 - 48/5-7 [24] Stiller, A.; Ungerer, B.: Ausgerechnet. c't 1990, Heft 1, Seiten 90-92 [25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni 1991, Seiten 214-237 [26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987 Order No.210760-002 [27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices, June 1989 Order No. 11671B/0 [28] Intel RapidCAD(tm) Engineering CoProcessor Performance Brief. Intel Corporation, 1992 [29] i486(tm) Microprocessor Performance Report. Intel Corporation, April 1990 Order No. 240734-001 [30] Intel486(tm) DX2 Microprocessor Performance Brief. Intel Corporation, March 1992 Order No. 241254-001 [31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek Corporation, July 1990 DOC No. 9030 [32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek Corporation, July 1989 DOC No. 8943 [33] Abacus Software Designer's Guide. Weitek Corporation, September 1989 DOC No. 8967 [34] Stiller, A.: Cache & Carry. c't 1992, Heft 6, Seiten 118-130 [35] Stiller, A.: Cache & Carry, Teil 2. c't 1992, Heft 7, Seiten 28-34 [36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der Numerik-Prozessoren 8087/80287. München: tewi 1985 [37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation, September 1989 Order No. 270640-003 [38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990 [39] Engineering note 4x4 matrix multiply transformation. IIT, 1989 [40] Tscheuschner, E.: 4 mal 4 auf einen Streich. c't 1990, Heft 3, Seiten 266-276 [41] Goldberg, D.: Computer Arithmetic. In: Hennessy, J.L.; Patterson, D.A.: Computer Architecture A Quantitative Approach. San Mateo, CA: Morgan Kaufmann 1990 [42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989, Order No. 205835-007 [43] 8086/8088 User's Manual, Programmer's and Hardware Reference. Intel Corporation, 1989 Order No. 240487-001 [44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation, 1987 Order No. 210498-005 [45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel Corporation, May 1990 Order No. 290376-001 [46] Cyrix FasMath(tm) 82S87 Coprocessor Data Sheet. Cyrix Coporation, 1991 Document 94018-00 Rev. 1.0 [47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990 [48] 486(tm)SX(tm) Microprocessor/ 487(tm)SX(tm) Math CoProcessor Data Sheet. Intel Corporation, April 1991. Order No. 240950-001 [49] Schnurer, G.: Die gro"se Verlade. c't 1991, Heft 7, Seiten 55-57 [50] Schnurer, G.: Eine 4 f"ur alle. c't 1991, Heft 6, Seite 25 [51] Intel486(tm)DX Microprocessor Data Book. Intel Corporation, June 1991 Order No. 240440-004 [52] i486(tm) Microprocessor Hardware Reference Manual. Intel Corporation, 1990 Order No. 240552-001 [53] i486(tm) Microprocessor Programmer's Reference Manual. Intel Corporation, 1990 Order No. 240486-001 [54] Ungerer, B.: Kalte H"ute. c't 1992, Heft 8, Seiten 140-144 [55] Ungerer, B.: Hei"se Sache. c't 1991, Heft 4, Seiten 104-108 [56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Profesionell, Juni 1991, Seiten 214-237 [57] Niederkr"uger, W.: Lebendige Vergangenheit. c't 1990, Heft 12, Seiten 114-116 [58] ULSI Math*Co Advanced Math Coprocessor Technical Specification. ULSI System, 5/92, Rev. E [59] 387(tm)DX Math CoProcessor Data Sheet. Intel Corporation, September 1990. Order No. 240448-003 [60] 387(tm) Numerics Coprocessor Extension Data Sheet. Intel Corporation, February 1989. Order No. 231920-005 [61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a Numerical Coprocessor Based on Rational Approximations. IEEE Transactions on Computers, Vol. C-39, No. 8, August 1990, pp. 1030-1037 [62] 387(tm) SX Math CoProcessor Data Sheet. Intel Corporation, November 1989 Order No. 240225-005 [63] Frenkel, G.: Coprocessors Speed Numeric Operations. PC-Week, August 27, 1990 [64] Schnurer, G.; Stiller, A.: Auto-Matt. c't 1991, Heft 10, Seiten 94-96 [65] Grehan, R.: FPU Face-Off. Byte, November 1990, pp. 194-200 [66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number Theory. Preprint MCS-P84-0889, Mathematics and Computer Science Division, Argonne National Laboratory, August 1989 [67] Ferguson, W.E.: Selecting math coprocessors. IEEE Spectrum, July 1991, pp. 38-41 [68] Schnabel, J.: Viermal 387. Computer Pers"onlich 1991, Heft 22, Seiten 153-156 [69] Hofmann, J.: Starke Rechenknechte. mc 1990, Heft 7, Seiten 64-67 [70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power. Computer Live 1991, Heft 10, Seiten 138-149 Manufacturer's addresses Intel Corporation 3065 Bowers Avenue Santa Clara, CA 95051 USA IIT Integrated Information Technology, Inc. 2540 Mission College Blvd. Santa Clara, CA 95054 USA ULSI Systems, Inc. 58 Daggett Drive San Jose, CA 95134 USA Chips & Technologies, Inc. 3050 Zanker Road San Jose, CA 95134 USA Weitek Corporation 1060 East Arques Avenue Sunnyvale, CA 94086 USA AMD Advanced Microdevices, Inc. 901 Thompson Place P.O.B. 3453 Sunnyvale, CA 94088-3453 USA Cyrix Corporation P.O.B. 850118 Richardson, TX 75085 USA Appendix A {$N+,E+} PROGRAM PCtrl; VAR B,c: EXTENDED; Precision, L: WORD; PROCEDURE SetPrecisionControl (Precision: WORD); (* This procedure sets the internal precision of the NDP. Available *) (* precision values: 0 - 24 bits (SINGLE) *) (* 1 - n.a. (mapped to single) *) (* 2 - 53 bits (DOUBLE) *) (* 3 - 64 bits (EXTENDED) *) VAR CtrlWord: WORD; BEGIN {SetPrecisionCtrl} IF Precision = 1 THEN Precision := 0; Precision := Precision SHL 8; { make mask for PC field in ctrl word} ASM FSTCW [CtrlWord] { store NDP control word } MOV AX, [CtrlWord] { load control word into CPU } AND AX, 0FCFFh { mask out precision control field } OR AX, [Precision] { set desired precision in PC field } MOV [CtrlWord], AX { store new control word } FLDCW [CtrlWord] { set new precision control in NDP } END; END; {SetPrecisionCtrl} BEGIN {main} FOR Precision := 1 TO 3 DO BEGIN B := 1.2345678901234567890; SetPrecisionControl (Precision); FOR L := 1 TO 20 DO BEGIN B := Sqrt (B); END; FOR L := 1 TO 20 DO BEGIN B := B*B; END; SetPrecisionControl (3); { full precision for printout } WriteLn (Precision, B:28); END; END. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$N+,E+} PROGRAM RCtrl; VAR B,c: EXTENDED; RoundingMode, L: WORD; PROCEDURE SetRoundingMode (RCMode: WORD); (* This procedure selects one of four available rounding modes *) (* 0 - Round to nearest (default) *) (* 1 - Round down (towards negative infinity) *) (* 2 - Round up (towards positive infinity) *) (* 3 - Chop (truncate, round towards zero) *) VAR CtrlWord: WORD; BEGIN RCMode := RCMode SHL 10; { make mask for RC field in control word} ASM FSTCW [CtrlWord] { store NDP control word } MOV AX, [CtrlWord] { load control word into CPU } AND AX, 0F3FFh { mask out rounding control field } OR AX, [RCMode] { set desired precision in RC field } MOV [CtrlWord], AX { store new control word } FLDCW [CtrlWord] { set new rounding control in NDP } END; END; BEGIN FOR RoundingMode := 0 TO 3 DO BEGIN B := 1.2345678901234567890e100; SetRoundingMode (RoundingMode); FOR L := 1 TO 51 DO BEGIN B := Sqrt (B); END; FOR L := 1 TO 51 DO BEGIN B := -B*B; END; SetRoundingMode (0); { round to nearest for printout } WriteLn (RoundingMode, B:28); END; END. +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$N+,E+} PROGRAM DenormTs; VAR E: EXTENDED; D: DOUBLE; S: SINGLE; BEGIN WriteLn ('Testing support and printing of denormals'); WriteLn; Write ('Coprocessor is: '); CASE Test8087 OF 0: WriteLn ('Emulator'); 1: WriteLn ('8087 or compatible'); 2: WriteLn ('80287 or compatible'); 3: WriteLn ('80387 or compatible'); END; WriteLn; S := 1.18e-38; S := S * 3.90625e-3; IF S = 0 THEN WriteLn ('SINGLE denormals not supported') ELSE BEGIN WriteLn ('SINGLE denormals supported'); WriteLn ('SINGLE denormal prints as: ', S); WriteLn ('Denormal should be printed as 4.60943...E-0041'); END; WriteLn; D := 2.24e-308; D := D * 3.90625e-3; IF D = 0 THEN WriteLn ('DOUBLE denormals not supported') ELSE BEGIN WriteLn ('DOUBLE denormals supported'); WriteLn ('DOUBLE denormal prints as: ', D); WriteLn ('Denormal should be printed as 8.75...E-0311'); END; WriteLn; E := 3.37e-4932; E := E * 3.90625e-3; IF E = 0 THEN WriteLn ('EXTENDED denormals not supported') ELSE BEGIN WriteLn ('EXTENDED denormals supported'); WriteLn ('EXTENDED denormal prints as: ', E); WriteLn ('Denormal should be printed as 1.3164...E-4934'); END; END. Appendix B ; FILE: APFELM4.ASM ; assemble with MASM /e APFELM4 or TASM /e APFELM4 CODE SEGMENT BYTE PUBLIC 'CODE' ASSUME CS: CODE PAGE ,120 PUBLIC APPLE87; APPLE87 PROC NEAR PUSH BP ; save caller's base pointer MOV BP, SP ; make new frame pointer PUSH DS ; save caller's data segment PUSH SI ; save register PUSH DI ; variables LDS BX, [BP+04] ; pointer to parameter record FINIT ; init 80x87 FSP->R0 FILD WORD PTR [BX+02] ; maxrad FSP->R7 FLD QWORD PTR [BX+08] ; qmax FSP->R6 FSUB QWORD PTR [BX+16] ; qmax-qmin FSP->R6 DEC WORD PTR [BX+04] ; ymax-1 FIDIV WORD PTR [BX+04] ; (qmax-qmin)/(ymax-1)FSP->R6 FSTP QWORD PTR [BX+16] ; save delta_q FSP->R7 FLD QWORD PTR [BX+24] ; pmax FSP->R6 FSUB QWORD PTR [BX+32] ; pmax-pmin FSP->R6 DEC WORD PTR [BX+06] ; xmax-1 FIDIV WORD PTR [BX+06] ; delta_p FSP->R6 MOV AX, [BX] ; save maxiter,[BX] needed for MOV [BX+2], AX ; 80x87 status now XOR BP, BP ; y=0 FLD QWORD PTR [BX+08] ; qmax FSP->R5 CMP WORD PTR [BX+40], 0 ; fast mode on 8087 desired ? JE yloop ; no, normal mode FSTCW [BX] ; save NDP control word AND WORD PTR [BX], 0FCFFh; set PCTRL = single precision FLDCW [BX] ; get back NDP control word yloop: XOR DI, DI ; x=0 FLD QWORD PTR [BX+32] ; pmin FSP->R4 xloop: FLDZ ; j**2= 0 FSP->R3 FLDZ ; 2ij = 0 FSP->R2 FLDZ ; i**2= 0 FSP->R1 MOV CX, [BX+2] ; maxiter MOV DL, 41h ; mask for C0 and C3 cond.bits iteration: FSUB ST, ST(2) ; i**2-j**2 FSP->R1 FADD ST, ST(3) ; i**2-j**2+p = i FSP->R1 FLD ST(0) ; duplicate i FSP->R0 FMUL ST(1), ST ; i**2 FSP->R0 FADD ST, ST(0) ; 2i FSP->R0 FXCH ST(2) ; 2*i*j FSP->R0 FADD ST, ST(5) ; 2*i*j+q = j FSP->R0 FMUL ST(2), ST ; 2*i*j FSP->R0 FMUL ST, ST(0) ; j**2 FSP->R0 FST ST(3) ; save j**2 FSP->R0 FADD ST, ST(1) ; i**2+j**2 FSP->R0 FCOMP ST(7) ; i**2+j**2 > maxrad? FSP->R1 FSTSW [BX] ; save 80x87 cond.codeFSP->R1 TEST BYTE PTR [BX+1], DL ; test carry and zero flags LOOPNZ iteration ; until maxiter if not diverg. MOV DX, CX ; number of loops executed NEG CX ; carry set if CX <> 0 ADC DX, 0 ; adjust DX if no. of loops<>0 ; plot point here (DI = X, BP = y, DX has the color) FSTP ST(0) ; pop i**2 FSP->R2 FSTP ST(0) ; pop 2ij FSP->R3 FSTP ST(0) ; pop j**2 FSP->R4 FADD ST,ST(2) ; p=p+delta_p FSP->R4 INC DI ; x:=x+1 CMP DI, [BX+6] ; x > xmax ? JBE xloop ; no, continue on same line FSTP ST(0) ; pop p FSP->R5 FSUB QWORD PTR [BX+16] ; q=q-delta_q FSP->R5 INC BP ; y:=y+1 CMP BP, [BX+4] ; y > ymax ? JBE yloop ; no, picture not done yet groesser: POP DI ; restore POP SI ; register variables POP DS ; restore caller's data segm. POP BP ; save caller's base pointer RET 4 ; pop parameters and return APPLE87 ENDP CODE ENDS END ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ UNIT Time; INTERFACE FUNCTION Clock: LONGINT; { same as VMS; time in milliseconds } IMPLEMENTATION FUNCTION Clock: LONGINT; ASSEMBLER; ASM PUSH DS { save caller's data segment } XOR DX, DX { initialize data segment to } MOV DS, DX { access ticker counter } MOV BX, 46Ch { offset of ticker counter in segm.} MOV DX, 43h { timer chip control port } MOV AL, 4 { freeze timer 0 } PUSHF { save caller's int flag setting } STI { allow update of ticker counter } LES DI, DS:[BX] { read BIOS ticker counter } OUT DX, AL { latch timer 0 } LDS SI, DS:[BX] { read BIOS ticker counter } IN AL, 40h { read latched timer 0 lo-byte } MOV AH, AL { save lo-byte } IN AL, 40h { read latched timer 0 hi-byte } POPF { restore caller's int flag } XCHG AL, AH { correct order of hi and lo } MOV CX, ES { ticker counter 1 in CX:DI:AX } CMP DI, SI { ticker counter updated ? } JE @no_update { no } OR AX, AX { update before timer freeze ? } JNS @no_update { no } MOV DI, SI { use second } MOV CX, DS { ticker counter } @no_update:NOT AX { counter counts down } MOV BX, 36EDh { load multiplier } MUL BX { W1 * M } MOV SI, DX { save W1 * M (hi) } MOV AX, BX { get M } MUL DI { W2 * M } XCHG BX, AX { AX = M, BX = W2 * M (lo) } MOV DI, DX { DI = W2 * M (hi) } ADD BX, SI { accumulate } ADC DI, 0 { result } XOR SI, SI { load zero } MUL CX { W3 * M } ADD AX, DI { accumulate } ADC DX, SI { result in DX:AX:BX } MOV DH, DL { move result } MOV DL, AH { from DL:AX:BX } MOV AH, AL { to } MOV AL, BH { DX:AX:BH } MOV DI, DX { save result } MOV CX, AX { in DI:CX } MOV AX, 25110 { calculate correction } MUL DX { factor } SUB CX, DX { subtract correction } SBB DI, SI { factor } XCHG AX, CX { result back } MOV DX, DI { to DX:AX } POP DS { restore caller's data segment } END; BEGIN Port [$43] := $34; { need rate generator, not square wave} Port [$40] := 0; { generator as prog. by some BIOSes } Port [$40] := 0; { for timer 0 } END. { Time } ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$A+,B-,R-,I-,V-,N+,E+} PROGRAM PeakFlop; USES Time; TYPE ParamRec = RECORD MaxIter, MaxRad, YMax, XMax: WORD; Qmax, Qmin, Pmax, Pmin: DOUBLE; FastMod: WORD; PlotFkt: POINTER; FLOPS:LONGINT; END; VAR Param: ParamRec; Start: LONGINT; {$L APFELM4.OBJ} PROCEDURE Apple87 (VAR Param: ParamRec); EXTERNAL; BEGIN WITH Param DO BEGIN MaxIter:= 50; MaxRad := 30; YMax := 30; XMax := 30; Pmin :=-2.1; Pmax := 1.1; Qmin :=-1.2; Qmax := 1.2; FastMod:= Word (FALSE); PlotFkt:= NIL; Flops := 0; END; Start := Clock; Apple87 (Param); { executes 104002 FLOP } Start := Clock - Start; { elapsed time in milliseconds } WriteLn ('Peak-MFLOPS: ', 104.002 / Start); END. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ ; FILE: M4X4.ASM ; ; assemble with TASM /e M4X4 or MASM /e M4X4 CODE SEGMENT BYTE PUBLIC 'CODE' ASSUME CS:CODE PUBLIC MUL_4x4 PUBLIC IIT_MUL_4x4 FSBP0 EQU DB 0DBh, 0E8h ; declare special IIT FSBP1 EQU DB 0DBh, 0EBh ; instructions FSBP2 EQU DB 0DBh, 0EAh F4X4 EQU DB 0DBh, 0F1h ;--------------------------------------------------------------------- ; ; MUL_4x4 multiplicates a four-by-four matrix by an array of four ; dimensional vectors. This operation is needed for 3D transformations ; in graphics data processing. There are arrays for each component of ; a vector. Thus there is an ; array containing all the x components, ; another containing all the y components and so on. Each component is ; an 8 byte IEEE floating point number. Two indices into the array of ; vectors are given. The first is the index of the vector that will be ; processed first, the second is the index of the vector processed ; last. ; ;--------------------------------------------------------------------- MUL_4x4 PROC NEAR AddrX EQU DWORD PTR [BP+24] ; address of X component array AddrY EQU DWORD PTR [BP+20] ; address of Y component array AddrZ EQU DWORD PTR [BP+16] ; address of Z component array AddrW EQU DWORD PTR [BP+12] ; address of W component array AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transform. mat. F EQU WORD PTR [BP+6] ; first vector to process K EQU WORD PTR [BP+4] ; last vector to process RetAddr EQU WORD PTR [BP+2] ; return address saved by call SavdBP EQU WORD PTR [BP+0] ; saved frame pointer SavdDS EQU WORD PTR [BP-2] ; caller's data segment PUSH BP ; save TURBO-Pascal frame pointer MOV BP, SP ; new frame pointer PUSH DS ; save TURBO-Pascal data segment MOV CX, K ; final index SUB CX, F ; final index - start index JNC $ok ; must not JMP $nothing ; be negative $ok: INC CX ; number of elements MOV SI, F ; init offset into arrays SHL SI, 1 ; each SHL SI, 1 ; element SHL SI, 1 ; has 8 bytes LDS DI, AddrT ; addr. of transformation mat. FLD QWORD PTR [DI] ; load a[0,0] = R7 FLD QWORD PTR [DI+8] ; load a[0,1] = R6 $mat_mul: LES BX, AddrX ; addr. of x component array FLD QWORD PTR ES:[BX+SI] ; load x[a] = R5 LES BX, AddrY ; addr. of y component array FLD QWORD PTR ES:[BX+SI] ; load y[a] = R4 LES BX, AddrZ ; addr. of z component array FLD QWORD PTR ES:[BX+SI] ; load z[a] = R3 LES BX, AddrW ; addr. of w component array FLD QWORD PTR ES:[BX+SI] ; load w[a] = R2 FLD ST(5) ; load a[0,0] = R1 FMUL ST, ST(4) ; a[0,0] * x[a] = R1 FLD ST(5) ; load a[0,1] = R0 FMUL ST, ST(4) ; a[0,1] * y[a] = R0 FADDP ST(1), ST ; a[0,0]*x[a]+a[0,1]*y[a]=R1 FLD QWORD PTR [DI+16] ; load a[0,2] = R0 FMUL ST, ST(3) ; a[0,2] * z[a] = R0 FADDP ST(1), ST ; a[0,0]*x[a]...a[0,2]*z[a]=R1 FLD QWORD PTR [DI+24] ; load a[0,3] = R0 FMUL ST, ST(2) ; a[0,3] * w[a] = R0 FADDP ST(1), ST ; a[0,0]*x[a]...a[0,3]*w[a]=R1 LES BX, AddrX ; get address of x vector FSTP QWORD PTR ES:[BX+SI] ; write new x[a] FLD QWORD PTR [DI+32] ; load a[1,0] = R1 FMUL ST, ST(4) ; a[1,0] * x[a] = R1 FLD QWORD PTR [DI+40] ; load a[1,1] = R0 FMUL ST, ST(4) ; a[1,1] * y[a] = R0 FADDP ST(1), ST ; a[1,0]*x[a]+a[1,1]*y[a]=R1 FLD QWORD PTR [DI+48] ; load a[1,2] = R0 FMUL ST, ST(3) ; a[1,2] * z[a] = R0 FADDP ST(1), ST ; a[1,0]*x[a]...a[1,2]*z[a]=R1 FLD QWORD PTR [DI+56] ; load a[1,3] = R0 FMUL ST, ST(2) ; a[1,3] * w[a] = R0 FADDP ST(1), ST ; a[1,0]*x[a]...a[1,3]*w[a]=R1 LES BX, AddrY ; get address of y vector FSTP QWORD PTR ES:[BX+SI] ; write new y[a] FLD QWORD PTR [DI+64] ; load a[2,0] = R1 FMUL ST, ST(4) ; a[2,0] * x[a] = R1 FLD QWORD PTR [DI+72] ; load a[2,1] = R0 FMUL ST, ST(4) ; a[2,1] * y[a] = R0 FADDP ST(1), ST ; a[2,0]*x[a]+a[2,1]*y[a]=R1 FLD QWORD PTR [DI+80] ; load a[2,2] = R0 FMUL ST, ST(3) ; a[2,2] * z[a] = R0 FADDP ST(1), ST ; a[2,0]*x[a]...a[2,2]*z[a]=R1 FLD QWORD PTR [DI+88] ; load a[2,3] = R0 FMUL ST, ST(2) ; a[2,3] * w[a] = R0 FADDP ST(1), ST ; a[2,0]*x[a]...a[2,3]*w[a]=R1 LES BX, AddrZ ; get address of z vector FSTP QWORD PTR ES:[BX+SI] ; write new z[a] FLD QWORD PTR [DI+96] ; load a[3,0] = R1 FMULP ST(4), ST ; a[3,0] * x[a] = R5 FLD QWORD PTR [DI+104] ; load a[3,1] = R1 FMULP ST(3), ST ; a[3,1] * y[a] = R4 FLD QWORD PTR [DI+112] ; load a[3,2] = R1 FMULP ST(2), ST ; a[3,2] * z[a] = R3 FLD QWORD PTR [DI+120] ; load a[3,3] = R1 FMULP ST(1), ST ; a[3,3] * w[a] = R2 FADDP ST(1), ST ; a[3,3]*w[a]+a[3,2]*z[a]=R3 FADDP ST(1), ST ; a[3,3]*w[a]...a[3,1]*y[a]=R4 FADDP ST(1), ST ; a[3,3]*w[a]...a[3,0]*x[a]=R5 LES BX, AddrW ; get address of w vector FSTP QWORD PTR ES:[BX+SI] ; write new w[a] ADD SI, 8 ; new offset into arrays DEC CX ; decrement element counter JZ $done ; no elements left, done JMP $mat_mul ; transform next vector $done: FSTP ST(0) ; clear FSTP ST(0) ; FPU stack $nothing: POP DS ; restore TP data segment POP BP ; restore TP frame pointer RET 24 ; pop parameters and return MUL_4X4 ENDP ;--------------------------------------------------------------------- ; ; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four ; dimensional vectors. This operation is needed for 3D transformations ; in graphics data processing. There are arrays for each component of ; a vector. Thus there is an array containing all the x components, ; another containing all the y components and so on. Each component is ; an 8 byte IEEE floating point number. Two indices into the array of ; vectors are given. The first is the index of the vector that will be ; processed first, the second is the index of the vector processed ; last. This subroutine uses the special instructions only available ; on IIT coprocessors to provide fast matrix multiply capabilities. ; So make sure to use it only on IIT coprocessors. ; ;--------------------------------------------------------------------- IIT_MUL_4x4 PROC NEAR AddrX EQU DWORD PTR [BP+24] ; address of X component array AddrY EQU DWORD PTR [BP+20] ; address of Y component array AddrZ EQU DWORD PTR [BP+16] ; address of Z component array AddrW EQU DWORD PTR [BP+12] ; address of W component array AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transf. matrix F EQU WORD PTR [BP+6] ; first vector to process K EQU WORD PTR [BP+4] ; last vector to process RetAddr EQU WORD PTR [BP+2] ; return address saved by call SavdBP EQU WORD PTR [BP+0] ; saved frame pointer SavdDS EQU WORD PTR [BP-2] ; caller's data segment Ctrl87 EQU WORD PTR [BP-4] ; caller's 80x87 control word PUSH BP ; save TURBO-Pascal frame ptr MOV BP, SP ; new frame pointer PUSH DS ; save TURBO-Pascal data seg. SUB SP, 2 ; make local variabe FSTCW [Ctrl87] ; save 80x87 ctrl word LES SI, AddrT ; ptr to transformation matrix FINIT ; initialize coprocessor FSBP2 ; set register bank 2 FLD QWORD PTR ES:[SI] ; load a[0,0] FLD QWORD PTR ES:[SI+32] ; load a[1,0] FLD QWORD PTR ES:[SI+64] ; load a[2,0] FLD QWORD PTR ES:[SI+96] ; load a[3,0] FLD QWORD PTR ES:[SI+8] ; load a[0,1] FLD QWORD PTR ES:[SI+40] ; load a[1,1] FLD QWORD PTR ES:[SI+72] ; load a[2,1] FLD QWORD PTR ES:[SI+104] ; load a[3,1] FINIT ; initialize coprocessor FSBP1 ; set register bank 1 FLD QWORD PTR ES:[SI+16] ; load a[0,2] FLD QWORD PTR ES:[SI+48] ; load a[1,2] FLD QWORD PTR ES:[SI+80] ; load a[2,2] FLD QWORD PTR ES:[SI+112] ; load a[3,2] FLD QWORD PTR ES:[SI+24] ; load a[0,3] FLD QWORD PTR ES:[SI+56] ; load a[1,3] FLD QWORD PTR ES:[SI+88] ; load a[2,3] FLD QWORD PTR ES:[SI+120] ; load a[3,3] ; transformation matrix loaded MOV AX, F ; index of first vector MOV DX, K ; index of last vector MOV BX, AX ; index 1st vector to process MOV CL, 3 ; component has 8 (2**3) bytes SHL BX, CL ; compute offset into arrays FINIT ; initialize coprocessor FSBP0 ; set register bank 0 $mat_loop:LES SI, AddrW ; addr. of W component array FLD QWORD PTR ES:[SI+BX] ; W component current vector LES SI, AddrZ ; addr. of Z component array FLD QWORD PTR ES:[SI+BX] ; Z component current vector LES SI, AddrY ; addr. of Y component array FLD QWORD PTR ES:[SI+BX] ; Y component current vector LES SI, AddrX ; addr. of X component array FLD QWORD PTR ES:[SI+BX] ; X component current vector F4X4 ; mul 4x4 matrix by 4x1 vector INC AX ; next vector MOV DI, AX ; next vector SHL DI, CL ; offset of vector into arrays FSTP QWORD PTR ES:[SI+BX] ; store X comp. of curr. vect. LES SI, AddrY ; address of Y component array FSTP QWORD PTR ES:[SI+BX] ; store Y comp. of curr. vect. LES SI, AddrZ ; address of Z component array FSTP QWORD PTR ES:[SI+BX] ; store Z comp. of curr. vect. LES SI, AddrW ; address of W component array FSTP QWORD PTR ES:[SI+BX] ; store W comp. of curr. vect. MOV BX, DI ; ofs nxt vect. in comp. arrays CMP AX, DX ; nxt vector past upper bound? JLE $mat_loop ; no, transform next vector FLDCW [Ctrl87] ; restore orig 80x87 ctrl word ADD SP, 2 ; get rid of local variable POP DS ; restore TP data segment POP BP ; restore TP frame pointer RET 24 ; pop parameters and return IIT_MUL_4x4 ENDP CODE ENDS END ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ {$N+,E+} PROGRAM Trnsform; USES Time; CONST VectorLen = 8190; TYPE Vector = ARRAY [0..VectorLen] OF DOUBLE; VectorPtr = ^Vector; Mat4 = ARRAY [1..4, 1..4] OF DOUBLE; VAR X: VectorPtr; Y: VectorPtr; Z: VectorPtr; W: VectorPtr; T: Mat4; K: INTEGER; L: INTEGER; First: INTEGER; Last: INTEGER; Start: LONGINT; Elapsed:LONGINT; PROCEDURE MUL_4X4 (X, Y, Z, W: VectorPtr; VAR T: Mat4; First, Last: INTEGER); EXTERNAL; PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr; VAR T: Mat4; First, Last: INTEGER); EXTERNAL; {$L M4X4.OBJ} BEGIN WriteLn ('Test8087 = ', Test8087); New (X); New (Y); New (Z); New (W); FOR L := 1 TO VectorLen DO BEGIN X^ [L] := Random; Y^ [L] := Random; Z^ [L] := Random; W^ [L] := Random; END; X^ [0] := 1; Y^ [0] := 1; Z^ [0] := 1; W^ [0] := 1; FOR K := 1 TO 4 DO BEGIN FOR L := 1 TO 4 DO BEGIN T [K, L] := (K-1)*4 + L; END; END; First := 0; Last := 8190; Start := Clock; MUL_4X4 (X, Y, Z, W, T, First, Last); { IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); } Elapsed := Clock - Start; WriteLn ('Number of vectors: ', Last-First+1); WriteLn ('Time: ', Elapsed, ' ms'); WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/ (Elapsed*1e-3):0:4, ' MFLOPS'); WriteLn; WriteLn ('Last vector:'); WriteLn; WriteLn (X^[Last]); WriteLn (Y^[Last]); WriteLn (Z^[Last]); WriteLn (W^[Last]); END.