home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-08-05 | 254.2 KB | 4,670 lines |
-
-
- EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT MATH COPROCESSORS
-
- This document has been created to provide the net.community with some
- detailed information about mathematical coprocessors for the Intel 80x86 CPU
- family. It may also help to answer some of the FAQs (frequently asked
- questions) about this topic. The primary focus of this document is on 80387-
- compatible chips, but there is also some information on the other chips in
- the 80x87 family and the Weitek family of coprocessors. Care was taken to
- make the information included as accurate as possible. If you think you have
- discovered erroneous information in this text, or think that a certain detail
- needs to be clarified, or want to suggest additions, feel free to contact me
- at:
-
- juffa@ira.uka.de
-
- or at my SnailMail address:
-
- Norbert Juffa
- Wielandtstr. 14
- 76137 Karlsruhe
- Germany
-
-
- This is the sixth version of this document (dated 01-Oct-94) and I'd like
- to thank those who have helped improving it by commenting on the previous
- versions:
-
- Fred Dunlap (fred@cyrix.com), Peter Forsberg (peter@vnet.ibm.com),
- Richard Krehbiel (richk@grevyn.com), Arto Viitanen (av@cs.uta.fi),
- Jerry Whelan (guru@stasi.bradley.edu),
- Eric Johnson (johnson%camax01@uunet.UU.NET), Warren Ferguson
- (ferguson@seas.smu.edu), Bengt Ask (f89ba@efd.lth.se), Thomas Hoberg
- (tmh@prosun.first.gmd.de), Nhuan Doduc (ndoduc@framentec.fr), John
- Levine (johnl@iecc.cambridge.ma.us), David Hough (dgh@validgh.com),
- Duncan Murdoch (dmurdoch@mast.QueensU.CA), Benjamin Eitan
- (benny.iil.intel.com)
-
- A very special thanks goes to David Ruggiero (osiris@halcyon.halcyon.com),
- who did a great job editing and formatting this article. Thanks David!
-
-
- Contents of this document
- -------------------------
-
- 1) What are math coprocessors?
- 2) How PC programs use a math coprocessor
- 3) Which applications benefit from a math coprocessor
- 4) Potential performance gains with a math coprocessor
- 5) How various math coprocessors work
- 6) Coprocessor emulator software
- 7) Installing a math coprocessor
- 8) Detailed description and specifications for all available math
- coprocessor chips
- 9) Finding out which coprocessor you have (the COMPTEST program)
- 10) Current coprocessor prices and purchasing advice
- 11) The coprocessor benchmark programs (performance comparisons of
- available math coprocessors using various CPUs)
- 12) Clock-cycle timings for each coprocessor instruction
- 13) Accuracy tests and IEEE-754 conformance for various coprocessors
- 14) Accuracy of transcendental function calculations for various coprocessors
- 15) Compatibility tests with Intel's 387DX / the SMDIAG program
- 16) References (literature)
- 17) Addresses of manufacturers of math coprocessors
- 18) Appendix A: Test programs for partial compatibility and accuracy checks
- 19) Appendix B: Benchmark programs TRNSFORM and PEAKFLOP
-
-
-
- ===========================
- What are math coprocessors?
- ===========================
-
- A coprocessor in the traditional sense is a processor, separate from the main
- CPU, that extends the capabilities of a CPU in a transparent manner. This
- means that from the program's (and programmer's) point of view, the CPU and
- coprocessor together look like a single, unified machine.
-
- The 80x87 family of math coprocessors (also known as MCPs [Math
- CoProcessors], NDPs [Numerical Data Processors], NPXs [Numerical Processor
- eXtensions], or FPUs [Floating-Point Units], or simply "math chips") are
- typical examples of such coprocessors. The 80x86 CPUs, with the exception of
- the 80486 (which has a built-in FPU) can only handle 8, 16, or 32 bit
- integers as their basic data types. However, many PC-based applications
- require the use of not only integers, but floating-point numbers. Simply put,
- the use of floating-point numbers enables a binary representation of not only
- integers, but also fractional values over a wide range. A common application
- of floating-point numbers is in scientific applications, where very small
- (e.g., Planck's constant) and very large numbers (e.g., speed of light) must
- be accurately expressed. But floating-point numbers are also useful for
- business applications such as computing interest, and in the geometric
- calculations inherent in CAD/CAM processing.
-
- Because the instruction sets of all 80x86 CPUs directly support only integers
- and calculations upon integers, floating-point numbers and operations on them
- must be programmed indirectly by using series of CPU integer instructions.
- This means that computations when floating-point numbers are used are far
- slower than normal, integer calculations. And this is where the 80x87
- coprocessors come in: adding an 80x87 to an 80x86-based system augments the
- CPU architecture with eight floating-point registers, five additional data
- types and over 70 additional instructions, all designed to deal directly with
- floating-point numbers as a basic data type. This removes the 'penalty' for
- floating-point computations, and greatly increases overall system performance
- for applications which depend heavily on these calculations.
-
- In addition to being able to quickly execute load/store operations on
- floating-point numbers, the 80x87 coprocessors can directly perform all the
- basic arithmetic operation on them. Besides "knowing" how to add, subtract,
- multiply and divide floating-point numbers, they can also operate on them to
- perform comparisons, square roots, transcendental functions (such as logarithms
- and sine/cosine/tangent), and compute their absolute value and remainder.
-
- Like most things in life, floating-point arithmetic has been standardized.
- The relevant standard (to which I will refer quite often in this document) is
- the "IEEE-754 Standard for Binary Floating-Point Arithmetic" [10,11]. The
- standard specifies numeric formats, value sets and how the basic arithmetic
- (+,-,*,/,sqrt, remainder) has to work. All the coprocessors covered in this
- document claim full or at least partial compliance with the IEEE-754
- standard.
-
-
-
- =================================================
- How PC programs use 80x87 and Weitek coprocessors
- =================================================
-
- The basic data type used by all 80x87 coprocessors is an 80-bit long
- floating-point number. This data type (called "temporary real" or "double
- extended precision") can directly represent numbers which range in size
- between 3.36*10^-4932 and 1.19*10^4932 (3.65*10^-4951 to 1.19*10^4932
- including denormal numbers) where '^' denotes the power operator. (For those
- familiar with floating-point formats, this format has 64 mantissa bits, 15
- exponent bits and 1 sign bit, for the total of 80 bits.) This format provides
- a precision of about 19 decimal places. 80x87s can also handle additional
- data types that are converted to/from the internal format upon being loaded
- or stored to/from the coprocessor. These include 16 bit, 32 bit, and 64 bit
- integers as well as a 18 digit BCD (binary coded decimal) data type occupying
- 10 bytes and providing 18 decimal digits.
-
- The 80x87 also supports two additional floating-point types. The short real
- data type (also called "single-precision") has 32 bits that split into 23
- mantissa bits, 8 exponent bit and a sign bit. By using the "hidden bit"
- technique, the effective length of the mantissa is increased to 24 bits. (The
- hidden bit technique exploits the fact that for normalized floating-point
- numbers, the mantissa m always is in the range 1 <= m < 2. Since the first
- mantissa bit represents the integer part of the mantissa, it is always set
- for normalized numbers, and therefore need not be stored, as it is guaranteed
- to always be 1.) The IEEE single-precision format provides a precision of
- about 6-7 decimal places and can represent numbers between 1.17*10^-38 and
- 3.40*10^38 (1.40*10^-45 to 3.40*10^38 including denormal numbers). The long
- real, or double-precision, data type has 64 bits, consisting of 52 mantissa
- bits, 11 exponent bits, and the sign bit. It provides 15-16 decimal digits of
- precision and can handle numbers from 2.22*10^-308 to 1.79*10^308 (4.94*10^-
- 324 to 1.79*10^308 including denormal numbers). (This format also uses the
- hidden bit technique to provide effectively 53 mantissa bits.)
-
- The eight registers in the 80x87 are organized in a stack-like manner which
- takes some time getting used to if one programs the coprocessor directly in
- assembly language. However, nowadays the compilers or interpreters for most
- high level languages (HLLs) can give a programmer easy access to the
- coprocessor's data types and use their instructions, so there is not much
- need to deal directly with the rather unusual architecture of the 80x87.
-
-
- The architecture of the Weitek chips differs significantly from the 80x87.
- Strictly speaking, the Weitek Abacus 3167 and 4167 are not coprocessors in
- that they do not transparently extend the CPU architecture; rather, they
- could be described as highly-specialized, memory-mapped IO devices. But as
- the term "coprocessor" has been traditionally used for these chips, they will
- be referred to as such here.
-
- The Weitek coprocessors have a RISC-like architecture which has been tuned
- for maximum performance. Only a small instruction set has been implemented in
- the chip, but each instruction executes at a very high speed (usually only a
- few clock cycles each). Instructions available include load/store, add,
- subtract, subtract reverse, multiply, multiply and negate, multiply and
- accumulate, multiply and take absolute value, divide reverse, negate,
- absolute value, compare/test, convert fix/float, and square root. In contrast
- to the 80x87 family, the Weitek Abacus does not support a double extended
- format, has no built-in transcendental functions, and does not support
- denormals. The resources required to implement such features have instead
- been devoted to implement the basic arithmetic operations as fast as
- possible.
-
- While the 80x87 coprocessors perform all internal calculations in double
- extended precision and therefore have about the same performance for single
- and double-precision calculations, the Weitek features explicit single and
- double-precision operations. For applications that require only single-
- precision operations, the Weitek can therefore provide very high performance,
- as single-precision operations are about twice as fast as their double-
- precision counterparts. Also, since the Weitek Abacus has more registers than
- the 80x87 coprocessors (31 versus 8), values can be kept in registers more
- often and have to be loaded from memory less frequently. This also leads to
- performance gains.
-
- The Weitek's register file consists of 31 32-bit registers, each one capable
- of holding an IEEE single-precision number. Pairs of consecutive single-
- precision registers can also be used as 64-bit IEEE double-precision
- registers; thus there are 15 double-precision registers. The Weitek register
- file has the standard organization like the register files in the 80386, not
- the special stack-like organization of the 80x87 coprocessors.
-
- To the main CPU, the Weitek Abacus appears as a 64 KB block of memory
- starting at physical address 0C0000000h. Each address in this range
- corresponds to a coprocessor instruction. Accessing a specified memory
- location within this block with a MOV instruction causes the corresponding
- Weitek instruction to be executed. (The instructions have been cleverly
- assigned to memory locations in such a way that loads to consecutive
- coprocessor registers can make use of the 386/486 MOVS string instruction.)
- This memory-mapped interface is much faster than the IO-oriented protocol
- that is used to couple the CPU to an 80287 or 80387 coprocessor. The Weitek's
- memory block can actually be assigned to any logical address using the MMU
- (memory management unit) in the 386/486's protected and virtual modes. This
- also means that the Weitek Abacus *cannot* be used in the real mode of those
- processors, since their physical starting address (0C0000000h) is not within
- the 1 MByte address range and the MMU is inoperable in real mode. However,
- DOS programs can make use of the Weitek by using a DOS extender or a memory
- manager (such as QEMM or EMM386) that runs in protected/virtual mode itself
- and can therefore map the Weitek's memory block to any desired location in
- the 1 MByte address range.
-
- Typically the FS segment register is then set up to point to the Weitek's
- memory block. On the 80486, this technique has severe drawbacks, as using the
- FS: prefix takes an additional clock cycle, thereby nearly halving the
- performance of the 4167. Most DOS-based compilers exhibit this problem, so
- the only way around it is to code in assembly language [75]. The Weitek
- Abacus 3167 and 4167 are also supported by the UNIX operating system [33].
-
-
-
- ==========================================================
- Which application programs benefit from a math coprocessor
- ==========================================================
-
- According to the Intel 387DX User's Guide, there are more than 2100
- commercial programs that can make use of a 387-compatible coprocessor. Every
- program that uses floating-point arithmetic somewhere and contains the
- instructions to support an 80x87 or Weitek chip can gain speed by installing
- one. However, the speedup will vary from program to program (and even within
- the same program) depending on how computation-intensive the program or
- operation within the program is. Typical applications that benefit from the
- use of a math coprocessor are:
-
- - CAD programs (AutoCAD, VersaCAD, GenericCAD)
- - Spreadsheet programs (Lotus 1-2-3, Excel, Quattro, Wingz)
- - Business graphics programs (Arts&Letters, Freedom of Press, Freelance)
- - Mathematical analysis and statistical programs (Mathematica, TKSolver,
- SPSS/PC, Statgraphics)
- - Database programs (dBase IV, FoxBase, Paradox, Revelation)
-
- Note that for spreadsheets and databases, a coprocessor only helps if some
- kind of floating-point computation is performed; this is true more often for
- spreadsheets than for databases. Also note that the speed of many programs
- depends quite heavily on factors such the speed of the graphics adapter (CAD)
- or the disk performance (databases), so the computational performance is only
- a (small) part of the total performance of the application. There are some
- programs that won't run without a coprocessor, among them AutoCAD (R10 and
- later) and Mathematica.
-
- Most GUIs (graphical user interfaces) such as Microsoft Windows or the OS/2
- Presentation Manager do *not* gain additional speed from using a
- *mathematical* coprocessor, since their graphics operations only use integer
- arithmetic [71]. They *will* benefit from a graphics board with a graphics
- "coprocessor" that speeds up certain common graphics operations such as
- BitBlt or line drawing. A few GUIs used on PCs, such as X-Windows, use a
- certain amount of floating-point operations for operations such as arc
- drawing. However, the use of floating-point operations in X-Windows seems to
- have decreased significantly in versions after X11R3, so the overall
- performance impact of a coprocessor is small [72]. Applications running under
- any GUI may take advantage of a math coprocessor, of course (for example,
- Microsoft Excel running under Windows).
-
- While support for 80x87 coprocessors is very common in application programs,
- the Weitek Abacus coprocessors do not enjoy such widespread support. Due to
- their higher price, only a few high-end PCs have been equipped with Weitek
- coprocessors. Some machines, such as IBM's PS/2 series, do not even have
- sockets to accommodate them. Therefore, most of the programs that support
- these coprocessors are also high-end products, like AutoCAD and Versacad-386.
-
-
-
- ==============================================
- Potential performance gains with a coprocessor
- ==============================================
-
- The Intel Math Coprocessor Utilities Disk that accompanies the Intel 387DX
- coprocessor has a demonstration program that shows the speedup of certain
- application programs when run with the Intel coprocessor versus a system with
- no coprocessor:
-
- Application Time w/o 387 Time w/387 Speedup
-
- Art&Letters 87.0 sec 34.8 sec 150%
- Quattro Pro 8.0 sec 4.0 sec 100%
- Wingz 17.9 sec 9.1 sec 97%
- Mathematica 420.2 sec 337.0 sec 25%
-
-
- The following table is an excerpt from [70]:
-
- Application Time w/o 387 Time w/387 Speedup
-
- Corel Draw 471.0 sec 416.0 sec 13%
- Freedom Of Press 163.0 sec 77.0 sec 112%
- Lotus 1-2-3 257.0 sec 43.0 sec 597%
-
-
- The following table is an excerpt from [25]:
-
- Application Time w/o 387 Time w/387 Speedup
-
- Design CAD, Test1 98.1 sec 50.0 sec 96%
- Design CAD, Test2 75.3 sec 35.0 sec 115%
- Excel, Test 1 9.2 sec 6.8 sec 35%
- Excel, Test 1 12.6 sec 9.3 sec 35%
-
-
- Note that coprocessor performance also depends on the motherboard, or more
- specifically, the chipset used on the motherboard. In [34] and [35]
- identically configured motherboards using different 386 chipsets were tested.
- Among other tests a coprocessor benchmark was run which is based on a fractal
- computation and its execution time recorded. The following tables showing
- coprocessor performance to vary with the chipset have been copied from these
- articles in abridged form:
-
- Cyrix Cyrix
- chip set 387+ chip set 83D87
-
- Opti, 40 MHz 24.57 sec 97.0% PC-Chips, 33 MHz 26.97 sec 93.0%
- Elite,40 MHz 24.46 sec 97.4% UMC, 33 MHz 27.69 sec 90.5%
- ACT, 40 MHz 23.84 sec 100.0% Headland, 33 MHz 25.08 sec 100.0%
- Forex,40 MHz 23.84 sec 100.0% Eteq, 33 MHz 27.38 sec 91.6%
-
-
- This shows that performance of the same coprocessor can vary by up to ~10%
- depending on the chipset used on your board, at least for 386 motherboards
- (similar numbers for 286, 386SX, and 486 are, unfortunately, not available).
- The benchmarks for this article were run on a motherboard with the Forex chip
- set, one of the fastest 386 chip sets available, and not only with respect to
- floating-point performance [35].
-
-
-
- ==================================
- How various math coprocessors work
- ==================================
-
- In any 80x86 system with an 80x87 math coprocessor, CPU instructions and
- coprocessor instructions are executed concurrently. This means that the CPU
- can execute CPU instructions while the coprocessor executes a coprocessor
- instruction at the same time. The concurrency is restricted somewhat by the
- fact that the CPU has to aid the coprocessor in certain operations. As the
- CPU and the coprocessor are fed from the same instruction stream and both
- instruction streams may operate on the same data, there has to be a
- synchronizing mechanism between the CPU and the coprocessor.
-
-
- The 8087
- --------
- In 8086/8088 systems with 8087 coprocessors, both chips look at every opcode
- coming in from the bus. To do this, both chips have the same BIU (bus
- interface unit) and the 8086 BIU sends the status signals of its prefetch
- queue to the 8087 BIU. This insures that both processors always decode the
- same instructions in parallel. Since all coprocessor instruction start with
- the bit pattern 11011, it is easy for the 8087 to ignore all other
- instructions. Likewise the CPU ignores all coprocessor instructions, unless
- they access memory. In this case, the CPU computes the address of the LSB
- (least significant byte) of the memory operand and does a dummy read. The
- 8087 then takes the data from the data bus. If more than one memory access is
- needed to load an memory operand, the 8087 requests the bus from the CPU,
- generates the consecutive addresses of the operand's bytes and fetches them
- from the data bus. After completing the operation, the 8087 hands bus control
- back to the CPU. Since 8087 and CPU are hooked up to the same synchronous
- bus, they must run at the same speed. This means that with the 8087, only
- synchronous operation of CPU and coprocessor is possible.
-
- Another 8087 coprocessor instruction can only be started if the previous one
- has been completed in the NEU (numerical execution unit) of the 8087. To
- prevent the 8086 from decoding a new coprocessor instruction while the 8087
- is still executing the previous coprocessor instruction, a coding mechanism
- is employed: All 8087-capable compilers and assemblers automatically
- generate a WAIT instruction before each coprocessor instruction. The WAIT
- instruction tests the CPU's /TEST pin and suspends execution until its input
- becomes "LOW". In all 8086/8087 systems, the 8086 /TEST pin is connected to
- the 8087 BUSY pin. As long as the NEU executes a coprocessor instruction, it
- forces its BUSY pin "HIGH"; thus, the WAIT opcode preceding the coprocessor
- instruction stops the CPU until any still-executing coprocessor instruction
- has finished.
-
- The same synchronization is used before the CPU accesses data that was
- written by the coprocessor. A WAIT instruction after any coprocessor
- instruction that writes to memory causes the CPU to stop until the
- coprocessor has completed transfer of the data to memory, after which the CPU
- can safely access it.
-
-
- The 80287
- ---------
- The 80287 coprocessor-CPU interface is totally different from the 8087
- design. Since the 80286 implements memory protection via an MMU based on
- segmentation, it would have been much too expensive to duplicate the whole
- memory protection logic on the coprocessor, which an interface solution
- similar to the 8087 would have required. Instead, in an 80286/80287 system,
- the CPU fetches and stores all opcodes and operands for the coprocessor.
- Information is then passed through the CPU ports F8h-FFh. (As these ports are
- accessible under program control, care must be taken in user programs not to
- accidentally perform write operations to them, as this could corrupt data in
- the math coprocessor.)
-
- The 8087/8087 combination can be characterized as a cooperation of partners
- with equal rights, while the 80286/287 is more a master-slave relationship.
- This makes synchronization easier, since the complete instruction and data
- flow of the coprocessor goes through the CPU. Before executing most
- coprocessor instructions, the 80286 tests its /BUSY pin, which is tied to the
- 287 coprocessor and signals if the 80287 is still executing a previous
- coprocessor instruction or has encountered an exception. The 80286 then waits
- until the /BUSY signal goes to "low" before loading the next coprocessor
- instruction into the 80287. Therefore, a WAIT instruction before every
- coprocessor instruction is not required. These WAITs are permissible, but not
- necessary, in 80287 programs. The second form of WAIT synchronization (after
- the coprocessor has written a memory operand) *is* still necessary on 286/287
- systems.
-
- The execution unit of the 80287 is practically identical to that of the 8087;
- that is, nearly all coprocessor instructions execute in the same number of
- clock cycles on both coprocessors. However, due to the additional overhead of
- the 80287's CPU/coprocessor interface (at least ~40 clock cycles), an 8 MHz
- 80286/80287 combination can have lower floating-point performance than an
- 8086/8087 system running at the same speed. Additionally, older 286 boards
- were often configured to run the coprocessor at only 2/3 the speed of the
- CPU, making use of the ability of the 80287 to run asynchronously: The 80287
- has a CKM pin that causes the incoming system clock to be divided by three
- for the coprocessor if it is tied to ground. The 80286 always divides the
- system clock by two internally, hence the final ratio of 2/3. However, when
- the CKM (ClocK Mode) pin is tied high on the 80287, it does not divide the
- CLK input. This feature has been exploited by the maker of coprocessor speed
- sockets. These sockets tie CKM high and supply their own CLK signal with a
- built-in oscillator, thereby allowing the 80287 or compatible to run at a
- much higher speed than the CPU. With an IIT or Cyrix 287 one can have a 20
- MHz coprocessor running with a 8 MHz 80286! Note, however, that the floating-
- point performance of such a configuration does not scale linearly with the
- coprocessor clock, since all the data has to be passed through the much
- slower CPU. If the coprocessor executes mostly simple instructions (such as
- addition and multiplication), doubling the coprocessor clock to 20 MHz in a
- 10 MHz system does not show any performance increase at all [24].
-
- The Intel 80287XL, the Cyrix 82S87, and the IIT 2C87 contain the internals of
- a 387 coprocessor, but are pin-compatible to the original 287. These chips
- divide the system clock by two internally, as opposed to three in the
- original 80287. Since the 80286 also divides the system clock by two, they
- usually run synchronously with respect to the CPU, although they can also be
- run asynchronously.
-
-
- The 80387
- ---------
- The coprocessor interface in 80386/80387 systems is very similar to the one
- found in 286/287 systems. However, to prevent corruption of the coprocessor's
- contents by programming errors, the IO ports 800000F8h-800000FFh are used,
- which are not accessible to programs. The CPU/coprocessor interface has been
- optimized and uses full 32-bit transfers; the interface overhead has been
- reduced to about 14-20 clock cycles. For some operations on the 387 'clones'
- that take less than about 16 clock cycles to complete, this overhead
- effectively limits the execution rate of coprocessor instructions. The only
- sensible solution to provide even higher floating-point performance was to
- integrate the CPU and coprocessor functionality onto the same chip, which
- is exactly what Intel did with the 80486 CPU. The FPU in the 486 also benefits
- from the instruction pipelining and from the on-chip cache.
-
-
-
- =====================
- Coprocessor emulators
- =====================
-
- In the absence of a coprocessor, floating-point calculations are often
- performed by a software package that simulates its operations. Such a program
- is called a coprocessor emulator. Simulating the coprocessor has the
- advantage for application programs that identical code can be generated for
- use with either the coprocessor and the emulator, so that it's possible to
- write programs that run on any system without regard to whether a coprocessor
- is present or not. Whether the program will use an actual coprocessor or
- software emulating it can easily be determined at run-time by detecting the
- presence or absence of the coprocessor chip.
-
- Two approaches to interface an 80x87 emulator to programs are common. The
- first method makes use of the fact that all coprocessor instruction start
- with the same five bit pattern 11011. Thus the first byte of a coprocessor
- instruction will be in the range D8-DF hexadecimal. In addition, coprocessor
- instructions usually are preceded by a WAIT instruction (opcode 9Bh) which is
- one byte long (the reason for doing this has been described in the previous
- chapter dealing with the operating details of the 80x87). One common approach
- is to replace the WAIT instruction and the first byte of the coprocessor
- instruction with one out of eight interrupt instructions; the remaining bytes
- of the coprocessor instruction are left unchanged. Interrupts 34 to 3B
- hexadecimal are used for this emulation technique. (Note that the sequences
- 9B D8 ... 9B DF can be easily converted to the interrupt instructions CD 34
- ... CD 3B by simple addition and subtraction of constants.) The compiler or
- assembler initially produces code that contains these appropriate interrupt
- calls instead of the coprocessor instructions. If a hardware coprocessor is
- detected at run-time, the emulator interrupts point to a short routine that
- converts the interrupts calls back to coprocessor instructions (yes, this
- is known as "self-modifying code"). If no coprocessor is found the interrupts
- point to the emulation package, which examines the byte(s) following the
- interrupt instruction to determine which floating-point operation to perform.
- This method is used by many compilers, including those from Microsoft and
- Borland. It works with every 80x86 CPU from the 8086/8088 on.
-
- The second method to interface an emulator is only available on 286/386/486
- machines. If the emulation bit in the machine status word of these processors
- is set, the processors will generate an interrupt 7 whenever a coprocessor
- instruction is encountered. The vector for this interrupt will have been set
- up to point at an emulation package that decodes the instruction and performs
- the desired operation. This approach has the advantage that the emulator
- doesn't have to be included in the program code, but can be loaded once (as a
- TSR or device driver) and then used by every program that requires a
- coprocessor. Emulation via interrupt 7 is transparent, which means that
- programs containing coprocessor instructions execute just like a coprocessor
- was present, only slower. This approach is taken by the public domain EM87
- emulator, the shareware program Q387, and the commercial Franke387 emulator,
- for example. Even programs that require a coprocessor to run like AutoCAD
- are 'fooled' to believe that a coprocessor is present with emulators using
- INT 7.
-
- Operating systems such as OS/2 2.0 and Windows 3.1 provide coprocessor
- emulations using INT 7 automatically if they do not find a coprocessor to be
- installed. The emulator in Windows doesn't seem to be very fast, as people
- who have ported their Turbo Pascal programs from the TP 6.0 DOS compiler
- (using the emulation built into the TP 6.0 run-time library) to the TPW 1.5
- Windows compiler (using MS Windows' emulator) have noticed. Slowdowns of as
- much as a factor of five have been reported [79]. Also, some instructions
- are not supported by the Windows 3.1 emulator, e.g. the FBSTP instruction.
-
- The size of the emulator used by TP 6.0 is about 9.5 KB, while EM87 occupies
- about 15.8 KB as a TSR, and Franke387 uses about 13.4 KB as a device driver.
- Note that Franke387 and especially EM87 model a real coprocessor much more
- closely than Turbo Pascal's emulator does. In particular, EM87 supports
- denormal numbers, precision control, and rounding control. The emulator in TP
- 6.0 does not implement these features. The version of Franke387 tested (V2.4)
- supports denormals in single and double-precision, but not double extended
- precision, and it supports precision control, but not rounding control.
- The shareware program Q387, introduced in 1992, only runs on 386, 386SX, 486SX
- and compatible processors. The program loads completely into extended memory
- and uses about 360 KB. To enable INT 7 trapping to a service routine in
- extended memory it needs to run with a memory manager (e.g. EMM386, QEMM,
- or 386MAX). The huge size of the program stems from the fact that it was
- solely optimized for speed, assuming that extended memory is a cheap resource.
- Presumably it uses large tables to speed computations. Q387 seems to be
- the only emulator that is still being maintained and updated at this time.
- Intel's E80287 program is supposed to be an 100% exact emulation of the
- 80287 coprocessor [44]. Note that the more closely a real coprocessor is
- modelled by the emulator, the slower the emulator runs and the larger the
- code for the emulator gets.
-
-
- Relative execution times of coprocessor vs. software emulators
- for selected coprocessor instructions
-
- Intel 387DX TP 6.0 Emulator EM87 Emulator
-
- FADD ST, ST(0) 1 26 104
- FDIV [DWord] 1 22 136
- FXAM 1 10 73
- FYL2X 1 33 102
- FPATAN 1 36 110
- F2XM1 1 38 110
-
-
-
- The following table is an excerpt from [44]:
-
- Intel 80287 Intel E80287 Emulator
-
- FADD ST, ST(0) 1 42
- FDIV [DWord] 1 266
- FXAM 1 139
- FYL2X 1 99
- FPATAN 1 153
- F2XM1 1 41
-
-
-
- The following has been adapted from [43] and merged with my own
- data:
-
- Intel 8087 TP 6.0 Emul. (8086) Intel Emul. (8086)
-
- FADD ST, ST(0) 1 20 94
- FDIV [DWord] 1 22 82
- FPTAN 1 18 144
- F2XM1 1 6 171
- FSQRT 1 44 544
-
-
-
- One of the reasons emulators are so slow is that they are often designed to
- run with every CPU from the 8086/8088 on upwards. This is the case with the
- emulators built into the compiler libraries of the Turbo Pascal 6.0 (also
- used by Turbo C/C++) and Microsoft C 6.0 compiler (probably also used in
- other Microsoft products) and is also true for the EM87 emulator in the
- public domain. By using code that can run on a 8086/8088, these emulators
- forego the speed advantage offered by the additional instructions and
- architectural enhancements (such as 32-bit registers) of the more advanced
- Intel 80x86 processors. A notable exception to this is the Franke387
- emulator, a commercial emulator that is also sold as shareware. It uses 386-
- specific 32-bit code and only runs on 386/386SX/486SX computers.
-
- Besides being slow, coprocessor emulators have other drawbacks when compared
- with real coprocessors. Most of the emulators do not support the additional
- instructions that the 387-compatible coprocessors offer over the 80287.
- Often, some of the low-level stack-manipulating instructions like FDECSTP are
- not emulated. For example, [76] lists the coprocessor instructions not
- emulated by Microsoft's emulator (included in the MS-C and MS-FORTRAN
- libraries) as follows:
-
- FCOS FRSTOR FSINCOS FXTRACT
- FDECSTP FSAVE FUCOM
- FINCSTP FSETPM FUCOMP
- FPREM1 FSIN FUCOMPP
-
- Additionally, some parts of the coprocessor architecture, like the status
- register, are often not or only partially emulated. Some emulators do not
- conform to the IEEE-754 standard in their implementation of the basic
- arithmetic functions, while the hardware coprocessors do. Also, they
- sometimes lack the support for denormals (a special class of floating-point
- numbers) although it is required by the standard. Not all the 80x87 emulators
- support rounding control and precision control, also features required by
- IEEE-754. Most of these omissions are aimed at making the emulator faster and
- smaller. Because of the performance gap and these other shortcomings of
- coprocessor emulators, a real coprocessor is a must for anybody planning to
- do some serious computations. (At today's prices, this shouldn't pose much of
- a problem to anybody!)
-
- Nhuan Doduc (ndoduc@framentec.fr) has tested a number of standalone
- coprocessor emulators for PCs, among them the two emulators, EM87 and
- Franke387 V2.4, already mentioned. He found Franke387 to be the best in terms
- of reliability, speed, and accuracy.
-
-
-
- =============================
- Installing a math coprocessor
- =============================
-
- Usually, installing a coprocessor doesn't pose much of a problem, as every
- coprocessor comes with installation instructions and a diagnostic disk that
- lets you check its correct operation after installation. In addition, the
- user manuals of most computers have a section on coprocessor installation.
-
- 1) Make sure to buy the right coprocessor for your system. An 8087 works
- together with 8086, 8088, V20, and V30 CPUs. An 80287, 287XL or
- compatible works with a 80286 CPU. (There are also some old 386
- motherboards that accept a 80287 coprocessor, but they usually also
- provide a socket for the 387; given today's pricing, it makes no sense
- not to get a 387 for these systems.) A 80387, 387DX or compatible
- coprocessor is for 386-based systems, as is the Intel RapidCAD. 387
- coprocessors also work with the Cyrix 486DLC CPU (which, despite its
- name, does not include an FPU). Similarly, the 387SX or compatible
- coprocessor go into systems whose CPU is a 386SX or Cyrix 486SLC.
-
- The Weitek Abacus 3167 works with a 386 CPU but requires a 121-pin EMC
- socket in the system; this is *not* the same socket used by a 80387 or
- compatible chip, and some computers, such as IBM's PS/2s, don't have
- this socket. The Weitek Abacus 4167 works together with the 486 and
- requires a special 142-pin socket to be present.
-
- 2) Always install a coprocessor that's rated at the same clock speed as the
- CPU. For example, in a 40 MHz 386 system using an AMD Am386-40, install
- a coprocessor rated for 40 MHz such as a Cyrix 83D87-40, C&T 38700DX-40,
- IIT 3C87-40, ULSI 83C87-40, or ULSI DX/DLC-40. Running a coprocessor
- above its specified frequency rating may cause it to produce false
- results, which you might fail to recognize as such. (I have personally
- experienced this problem with a Cyrix 83D87-33 that I tried to push to
- 40 MHz. It passed all the diagnostic benchmarks on the Cyrix diagnostic
- disk and the tests of some commercial system test programs. However, I
- found it to fail the Whetstone and Linpack benchmarks, which include
- accuracy checks.) Although there is usually no problem with overheating
- when pushing a coprocessor over the specified maximum frequency rating,
- be warned that operation of a coprocessor above the maximum ratings
- stated by the manufacturer may make its operation unreliable.
-
- Some 386 boards allow the coprocessor to be clocked differently than the
- CPU. This is called "asynchronous operation" and allows you, for
- example, to run the coprocessor at 33 MHz while the CPU runs at 40 MHz.
- Of the currently available math coprocessors, only the Intel 80387 and
- 387DX support asynchronous operation. The 387-compatible "clones" from
- Cyrix, C&T, IIT, and ULSI always run at the full speed of the CPU, even
- if you have set up your motherboard for asynchronous operation.
-
- 3) Once you've got the correct coprocessor for your system you can start
- the actual installation process. Turn off the computer's power switch
- and unplug the power cord from the wall outlet, remove the case, and
- locate the math coprocessor socket. This socket is always located right
- next to the main CPU, which can be identified by the printing on top of
- the chip. (It's also usually one of the biggest chips on the board). The
- 8078 and 80287 DIL sockets are rectangular sockets with 20 pin holes on
- each of the longer sides. The 387SX PLCC socket is a square socket that
- has 17 vertical connector strips on the 'wall' of each side. The 387 PGA
- socket is square and has two rows of pin holes on each side. The EMC
- socket for the Weitek 3167 is similar but has three rows of holes on
- each side. The PGA socket for the Weitek 4167 is also square with three
- rows of holes on each side. If you can't find the math coprocessor
- socket, consult your owner's manual, your computer dealer, or a
- knowledgeable friend.
-
- If you are installing the Intel RapidCAD chipset in a 386 system, you
- will have to remove the 386 CPU first. Intel provides an easy-to-use
- chip extractor and a storage box for the 386 chip for this purpose. Just
- follow the instructions in the RapidCAD installation manual.
-
- On many systems, the motherboard is supported only at a small number of
- points. Since considerable force is required to insert a pin grid chip
- like the 80387, RapidCAD, or Weitek Abacus 3167 into its socket, the
- board may bend quite a lot due to the insertion pressure. This could
- cause cracks in the board's conductive traces that may render it
- intermittently or completely inoperable. Damage done to the board in
- this way is usually not covered by the computer's warranty! Therefore,
- it may be a good idea to first check how much the board bends by
- pressing on the math coprocessor socket with your finger. If you find it
- to bend easily, try to put something under the board directly beneath
- the coprocessor socket. If this is impossible, as it is in many desktop
- cases, consider removing the whole mother board from the case, and
- placing it on a hard, flat surface free of static electricity. (You will
- also have to do this if your system's CPU and coprocessor socket are on
- a separate card rather than on the motherboard, as is typical in many
- modular systems.)
-
- Be sure you are properly grounded before you remove the coprocessor from
- its antistatic box, as even a tiny jolt of static electricity can ruin
- the coprocessor. Make sure you do not touch the pins on the bottom of
- the chip.
-
- Check the pins and make sure none are bent; if some are, you can
- *carefully* straighten them with needle-nose pliers or tweezers.
-
- 4) Match the coprocessor's orientation with the orientation of the socket.
- Correct orientation of the coprocessor is absolutely essential, because
- if you insert it the wrong way it may be damaged.
-
- 8087 and 287 coprocessors have a notch on one the shorter sides of their
- rectangular DIL package that should be matched with the notch of the
- coprocessor socket. Usually the 286 CPU and the 287 coprocessor are
- placed alongside each other and both have the same orientation, (that
- is, their respective notches point in the same direction). 387SX
- coprocessors feature a white dot or similar mark that matches with some
- sort of marking on the socket. 387 coprocessors have a bevelled corner
- that is also marked with a white dot or similar marking. This should be
- matched with the bevelled or otherwise marked corner of the socket. If
- your system has only a large EMC socket and you are installing a 387 in
- it, you will leave one row of pin holes free on each side of the chip.
-
- Once you have found the correct orientation, place the chip over the
- socket and make sure all pins are correctly aligned with their
- respective holes. Press firmly and evenly on the chip -- you may have to
- press hard to seat the coprocessor all the way. Again, make sure your
- motherboard does not bend more than slightly under the insertion
- pressure. For 8087, 287, and 387 coprocessors it is normal that the
- coprocessor does not go all the way in; about one millimeter (1/25 inch)
- of space is usually left between the socket and the bottom of the
- coprocessor chip. (This allows the insertion of a extraction device
- should it become necessary to remove the chip. Note that the
- construction of the 387SX's PLCC socket makes it next-to-impossible to
- remove the coprocessor once fully inserted, as the top of the chip is
- level with the socket's 'walls'.)
-
- 5) Check your computer's manual for the proper position of any jumpers or
- switches that need to be set to tell the system it now has a coprocessor
- (and possibly, which kind it has). Put the cover back on the system
- unit, reconnect the power, and turn on your computer. Depending on your
- system's BIOS, you may now have to run a setup or configuration program
- to enable the coprocessor. Finally, run the programs supplied on the
- diagnostic disk (included with your coprocessor) to check for its
- correct operation.
-
-
-
- =================================================================
- Descriptions of available coprocessors, CPU+FPU (as of 01-11-93):
- =================================================================
-
- Intel 8087
-
- [43] This was the first coprocessor that Intel made available for the
- 80x86 family. It was introduced in 1980 and therefore does not have full
- compatibility with the IEEE-754 standard for floating-point arithmetic,
- (which was finally released in 1985). It complements the 8088 and 8086
- CPUs and can also be interfaced to the 80188 and 80186 processors.
-
- The 8087 is implemented using NMOS. It comes in a 40-pin CERDIP (ceramic
- dual inline package). It is available in 5 MHz, 8 MHz (8087-2), and 10
- MHz (8087-1) versions. Power consumption is rated at max. 2400 mW [42].
-
- A neat trick to enhance the processing power of the 8087 for
- computations that use only the basic arithmetic operations (+,-,*,/) and
- do not require high precision is to set the precision control to single-
- precision. This gives one a performance increase of up to 20%. For
- details about programming the precision control, see program PCtrl in
- appendix A.
-
- With the help of an additional chip, the 8087 can in theory be
- interfaced to an 80186 CPU [36]. The 80186 was used in some PCs (e.g.
- from Philips, Siemens) in the 1982/1983 time frame, but with IBM's
- introduction of the 80286-based AT in 1984, it soon lost all
- significance for the PC market.
-
-
- Intel 80187
-
- The 80187 is a rather new coprocessor designed to support the 80C186
- embedded controller (a CMOS version of the 80186 CPU; see above). It was
- introduced in 1989 and implements the complete 80387 instruction set. It
- is available in a 40 pin CERDIP (ceramic dual inline package) and a 44
- pin PLCC (plastic leaded chip carrier) for 12.5 and 16 MHz operation.
- Power consumption is rated at max. 675 mW for the 12.5 MHz version and
- max. 780 mW for the 16 MHz version [37].
-
-
- Intel 80287
-
- [44] This is the original Intel coprocessor for the 80286, introduced in
- 1983. It uses the same internal execution unit as the 8087 and therefore
- has the same speed (actually, it is sometimes slower due to additional
- overhead in CPU-coprocessor communication). As with the 8087, it does
- not provide full compatibility with the IEEE-754 floating point standard
- released in 1985.
-
- The 80287 was manufactured in NMOS technology, and is packaged in a 40-
- pin CERDIP (ceramic dual inline package). There are 6 MHz, 8 MHz, and 10
- MHz versions. Power consumption can be estimated to be the same as that
- for the 8087, which is 2400 mW max.
-
- The 80287 has been replaced in the Intel 80x87 family with its faster
- successor, the CMOS-based Intel 287XL, which was introduced in 1990 (see
- below). There may still be a few of the old 80287 chips on the market,
- however.
-
-
- Intel 80287XL
-
- This chip is Intel's second-generation 287, first introduced in 1990.
- Since it is based on the 80387 coprocessor core, it features full IEEE
- 754 compatibility and faster instruction execution. Intel claims about
- 50% faster operation than the 80287 for typical benchmark tests such as
- Whetstone [45]. Comparison with benchmark results for the AMD 80C287,
- which is identical to the Intel 80287, support this claim [1]: The Intel
- 287XL performed 66% faster than the AMD 80C287 on a fractal benchmark
- and 66% faster on the Whetstone benchmark in these tests. Whetstone
- results from [46] show the Intel 287XL at 12.5 MHz to perform 552
- kWhets/sec as opposed to the AMD's 80C287 289 kWhets/sec, a 91%
- performance increase. A benchmark using the MathPak program showed the
- Intel 287XL to be 59% faster than the Intel 80287 (6.9 sec. vs. 11.0
- sec.) [26]. Since the 287XL has all the additional instructions and
- enhancements of a 387, most software automatically identifies it as an
- 80387-compatible coprocessor and therefore can make use of extra 387-
- only features, such as the FSIN and FCOS instructions.
-
- The 287XL is manufactured in CMOS and therefore uses much less power
- than the older NMOS-based 80287. At 12.5 MHz, the power consumption is
- rated at max. 675 mW, about 1/4 of the 80287 power consumption. The
- 287XL is available in either a 40-pin CERDIP (ceramic dual inline
- package) or a 44 pin PLCC (plastic leaded chip carrier). (This latter
- version is called the 287XLT and intended mainly for laptop use.) The
- 287XL is rated for speeds of up to 12.5 MHz.
-
-
- AMD 80C287
-
- This chip, manufactured by Advanced Micro Devices (AMD), is an exact
- clone of the old Intel 80287, and was first brought to market by AMD in
- 1989. It contains the original microcode of the 80287 and is therefore
- 100% compatible with it. However, as the name indicates, the 80C287 is
- manufactured in CMOS and therefore uses less power than an equivalent
- Intel 80287. At 12.5 MHz, its power consumption is rated at max. 625 mW
- or slightly less than that of the Intel 80287XL [27]. There is also
- another version called AMD 80EC287 that uses an 'intelligent' power save
- feature to reduce the power consumption below 80C287 levels. Tests at
- 10.7 MHz show typical power consumption for the 80EC287 to be at 30 mW,
- compared to 150 mW for the AMD 80C287, 300 mW for the Intel 287XL and
- 1500 mW for the Intel 80287 [57]. The 80EC287 is therefore ideally
- suited for low power laptop systems.
-
- The AMD 80C287 is available in speeds of 10, 12, and 16 MHz. (I have
- only seen it being offered in 10 MHz and 12 MHz versions, however.) At
- about US$ 50, it is currently the cheapest coprocessor available. Note
- that it provides less performance than the newer Intel 287XL (see
- above). The AMD 80C287 is available in 40 pin ceramic and plastic DIPs
- (dual inline package) and as 44 pin PLCC (plastic leaded chip carrier).
-
- Due to recent legal battles with Intel over the right to use the 287
- microcode, which AMD lost, AMD had to discontinue this product and it
- is no longer available.
-
-
- Cyrix 82S87
-
- This 80287-compatible chip was developed from the Cyrix 83D87, (Cyrix's
- 80387 'clone') and has been available since 1991. It complies completely
- with the IEEE-754 standard for floating-point arithmetic and features
- nearly total compatibility with Intel's coprocessors, including
- implementation of the full Intel 80387 instruction set. It implements
- the transcendental functions with the same degree of accuracy and the
- superior speed of the Cyrix 83D87. This makes the Cyrix 82S87 the
- fastest [1] and most accurate 287 compatible coprocessor available.
- Documentation by Cyrix [46] rates the 82S87 at 730 kWhets/sec for a 12.5
- MHz system, while the Intel 287XL performs only 552 kWhets/sec. 82S87
- chips manufactured after 1991 use the internals of the Cyrix 387+, which
- succeeds the original 83D87 [73].
-
- The 82S87 is a fully static CMOS design with very low power requirements
- that can run at speeds of 6 to 20 MHz. Cyrix documentation shows the
- 82S87 to consume about the same amount of power as the AMD 80C287 (see
- above). The 82S87 comes in a 40 pin DIP or a 44 pin PLCC (plastic leaded
- chip carrier) compatible with the pinout of the Intel 287XLT and
- ideally suited for laptop use.
-
-
- IIT 2C87
-
- This chip was the first 80287 clone available, introduced to the market
- in 1989. It has about the same speed as the Intel 287XL [1]. The 2C87
- implements the full 80387 instruction set [38]. Tests I ran on the 3C87
- seem to indicate that it is not fully compatible with the IEEE-754
- standard for floating-point arithmetic (see below for details), so it
- can be assumed that the 2C87 also fails these test (as it presumably
- uses the same core as the 3C87).
-
- The IIT 2C87 provides extra functions not available on any other 287
- chip [38]. It has 24 user-accessible floating-point registers organized
- into three register banks. Additional instructions (FSBP0, FSBP1, FSBP2)
- allow switching from one bank to another. (Transfers between registers
- in different banks are not supported, however, so this feature by itself
- is of limited usefulness. Also, there seems to be only one status
- register (containing the stack top pointer), so it has to be manually
- loaded and stored when switching between banks with a different number
- of registers in use [40]). The register bank's main purpose is to aid
- the fourth additional instruction the 2C87 has (F4X4), which does a full
- multiply of a 4x4 matrix by a 4x1 vector, an operation common in 3D-
- graphics applications [39]. The built-in matrix multiply speeds this
- operation up by a factor of 6 to 8 when compared to a programmed
- solution according to the manufacturer [38]. Tests show the speed-up to
- be indeed in this range [40]. For the 3C87, I measured the execution
- time of F4X4 to be about 280 clock cycles; the execution time on the
- 2C87 should be somewhat larger - I estimate it to be around 310 clock
- cycles due to the higher CPU-NDP communication overhead in instruction
- execution in 286/287 systems (~45-50 clock cycles) compared with 386/387
- systems (~16-20 clock cycles). As desirable as the F4X4 instruction may
- seem, however, there are very few applications that make use of it when
- an IIT coprocessor is detected at run time (among them Schroff
- Development's Silver Screen and Evolution Computing's Fast-CAD 3-D
- [25]).
-
- The 2C87 is available for speeds of up to 20 MHz. It is implemented in
- an advanced CMOS process and has therefore a low power consumption of
- typically about 500 mW [38].
-
-
- Intel 80387
-
- This chip was the first generation of coprocessors designed specifically
- for the Intel 80386 CPU. It was introduced in 1986, about one year after
- the 80386 was brought to market. Early 386 system were therefore
- equipped with both a 80287 and a 80387 socket. The 80386 does work with
- an 80287, but the numerical performance is hardly adequate for such a
- system.
-
- The 80387 has itself since been superseded by the Intel 387DX introduced
- by a quiet change in 1989 (see below). You might find it when acquiring
- an older 386 machine, though. The old 80387 is about 20% slower than the
- newer 387DX.
-
- The 80387 is packaged in a 68-pin ceramic PGA, and was manufactured
- using Intel's older 1.5 micron CHMOS III technology, giving it moderate
- power requirements. Power consumption at 16 MHz is max. 1250 mW (750 mW
- typical), at 20 MHz max. 1550 mW (950 mW typical), and at 25 MHz max.
- 1950 mW (1250 mW typical) [60].
-
-
- Intel 387DX
-
- The 387DX is the second-generation Intel 387; it was quietly introduced
- to replace the original 80387 in 1989. This version is done in a more
- advanced CMOS process which enables the coprocessor to run at a maximum
- frequency of 33 MHz (the 80387 was limited to a maximum frequency of 25
- MHz). The 387DX is also about 20% faster than the 80387 on the average
- for the same clock frequency. For a 386/387 system operating at 29 MHz
- the Whetstone benchmark (compiled with the highly optimizing Metaware
- High-C V1.6) runs at 2377 kWhetstones/sec for the 80387 and at 2693
- kWhetstones/sec for the 387DX, a 13% increase. In a fractal calculation
- programmed in assembly language, the 387DX performance was 28% higher
- than the performance of the 80387. The transcendental functions have
- also sped up from the 80387 to the 387DX. In the Savage benchmark
- (again, compiled with Metaware High-C V1.6 and running on a 29 MHz
- system), the 80387 evaluated 77600 function calls/second, while the
- 387DX evaluated 97800 function calls/second, a 26% increase [7]. Some
- instructions have been sped up a lot more than the average 20%. For
- example, the performance of the FBSTP instruction has increased by a
- factor of 3.64.
-
- The Intel 387DX (and its predecessor 80387) are the only 387
- coprocessors that support asynchronous operation of CPU and coprocessor.
- The 387 consists of a bus interface unit and a numerical execution unit.
- The bus interface unit always runs at the speed of the CPU clock
- (CPUCLK2). If the CKM (ClocK Mode) pin of the 387 is strapped to Vcc,
- the numerical execution unit runs at the same speed as the bus interface
- unit. If CKM is tied to ground, the numerical execution unit runs at the
- speed provided by the NUMCLK2 input. The ratio of NUMCLK2 (coprocessor
- clock) to CPUCLK2 (CPU clock) must lie within the range 10:16 to 14:10.
- For example, for a 20 MHz 386, the Intel 387DX could be clocked from
- 12.5 MHz to 28 MHz via the NUMCLK2 input. (On the Cyrix 83D87, Cyrix
- 387+, ULSI 83C87, ULSI DX/DLC, and the IIT 387, the CKM pin is not
- connected. These coprocessors are therefore not capable of asynchronous
- operation and always run at the speed of the CPU.)
-
- The Intel 387DX is manufactured using Intel's advanced low power CHMOS
- IV technology. Power consumption at 20 MHz is max. 900 mW (525 mW
- typical), at 25 MHz max. 1050 mW (625 mW typical), and at 33 MHz max.
- 1250 mW (750 mW typical) [59].
-
-
- Intel 387SX
-
- This is the coprocessor paired with the Intel 386SX CPU. The 386SX is an
- Intel 80386 with a 16-bit, rather than 32-bit, data path. This reduces
- (somewhat) the costs to build a 386SX system as compared to a full 32-
- bit design required by a 386DX. (The 386SX's main *marketing* purpose
- was to replace the 80286 CPU, which was being sold more cheaply by other
- manufacturers [such as AMD], and which Intel subsequently stopped
- producing.) Due to the 16-bit data path, the 386SX is slower than the
- 386DX and offers about the same speed as an 80286 at the same clock
- frequency for 16-bit applications. But as the 386SX is a complete 80386
- internally, it offers also the possibility to run 32-bit applications
- and supports the virtual 8086 mode (used for example by Windows' 386
- enhanced mode).
-
- The 387SX has all the features of the Intel 80387, including the ability
- of asynchronous operation of CPU and coprocessor (see Intel 387DX
- information, above). Due to the 16 bit data path between the CPU and the
- coprocessor, the 387SX is a bit slower than a 80387 operating at the
- same frequency. In addition, the 387SX is based on the core of the
- original 80387, which executes instructions slower than the second
- generation 387DX.
-
- The 387SX comes in a 68-pin PLCC (plastic leaded chip carrier) package
- and is available in 16 MHz and 20 MHz versions. (Coprocessors for faster
- 386SX systems based on the Am386SX CPU are available from IIT, Cyrix,
- and ULSI.) Power consumption for the 387SX at 16 MHz is max. 1250 mW
- (740 mW typical); for the 20 MHz version it is max. 1500 mW (1000 mW
- typical) [62].
-
-
- Intel 387SL
-
- This coprocessor is designed for use in systems that contain an Intel
- 386SL as the CPU. The 386SL is directly derived from the 386SX. It is a
- static CHMOS IV design with very low power requirements that is intended
- to be used in notebook and laptop computers. It features an integrated
- cache controller, a programmable memory controller, and hardware support
- for expanded memory according to the LIM EMS 4.0 standard. The 387SL,
- introduced in early 1992, has been designed to accompany the 386SL in
- machines with low power consumption and substitute the 387SX for this
- purpose. It features advanced power saving mechanisms. It is based on
- the 387DX core, rather than on the older and slower 80387 core (which is
- used by the 387SX).
-
-
- IIT 3C87
-
- This IIT chip was introduced in 1989, about the same time as the Cyrix
- 83D87. Both coprocessors are faster than Intel's 387DX coprocessor. The
- IIT 3C87 also provides extra functions not available on any other 387
- chip [38]. It has 24 user-accessible floating-point registers organized
- into three register banks. Three additional instructions (FSBP0, FSBP1,
- FSBP2) allow switching from one bank to another. (Transfers between
- registers in different banks are not supported, however, so this feature
- by itself is of limited usefulness. Also, there seems to be only one
- status register [containing the stack top pointer], so it has to be
- manually loaded and stored when switching between banks with a different
- number of registers in use [40]). The register bank's main purpose is to
- aid the fourth additional instruction the 3C87 has (F4X4), which does a
- full multiply of a 4x4 matrix by a 4x1 vector, an operation common in
- 3D-graphics applications [39]. The built-in matrix multiply speeds this
- operation up by a factor of 6 to 8 when compared to a programmed
- solution according to the manufacturer [38]. Tests show the speed-up to
- be indeed in this range [40]. I measured the F4X4 to execute in about
- 280 clock cycles, during which time it executes 16 multiplications and
- 12 additions. The built-in matrix multiply speeds up the matrix-by-
- vector multiply by a factor of 3 compared with a programmed solution
- according to IIT [39]. The results for my own TRNSFORM benchmark support
- this claim (see results below), showing a performance increase by a
- factor of about 2.5. This makes matrix multiplies on the IIT 3C87 nearly
- as fast as on an Intel 486 at the same clock frequency. As desirable as
- the F4X4 instruction may seem, however, there are very few applications
- that make use of it when an IIT coprocessor is detected at run time
- (among them Schroff Development's Silver Screen and Evolution
- Computing's Fast-CAD 3-D [25]).
-
- These IIT-specific instructions also work correctly when using a Chips &
- Technologies 38600DX or a Cyrix 486DLC CPU, which are both marketed as
- faster replacements for the Intel 386DX CPU.
-
- Tests I ran with the IEEETEST program show that the 3C87 is not fully
- compatible with the IEEE-754 standard for floating-point arithmetic,
- although the manufacturer claims otherwise. It is indeed possible that
- the reported errors are due to personal interpretations of the standard
- by the program's author that have been incorporated into IEEETEST and
- that the standard also supports the different interpretation chosen by
- IIT. On the other hand, the IEEE test vectors incorporated into IEEETEST
- have become somewhat of an industry standard [66] and Intel's 387, 486,
- and RapidCAD chips pass the test without a single failure, so the fact
- that the IIT 3C87 fails some of the tests indicates that it is not fully
- compatible with the Intel 387 coprocessor. My tests also show that the
- IIT 3C87 does not support denormals for the double extended format. It
- is not entirely clear whether the IEEE standard mandates support for
- extended precision denormals, as the IEEE-754 document explicitly only
- mentions single and double-precision denormals. Missing support for
- denormals is not a critical issue for most applications, but there are
- some programs for which support of denormals is at the very least quite
- helpful [41]. In any case, failure of the 3C87 to support extended
- precision denormal numbers does represent an incompatibility with the
- Intel 387 and 486 chips.
-
- The 3C87 is implemented in an advanced CMOS process and has low power
- requirements, typically about 600 mW. Like the 387 'clones' from Cyrix
- and ULSI, the 3C87 does not support asynchronous operation of the CPU
- and the coprocessor, but always runs at the full speed of the CPU. It is
- available in 16, 20, 25, 33, and 40 MHz versions.
-
-
- IIT 3C87SX
-
- This is the version of the IIT 3C87 that is intended for use with
- Intel's 386SX or AMD's Am386SX CPU, and is functionally equivalent to
- the IIT3C87. Due to the 16-bit data path between the CPU and the
- coprocessor in a 386SX-based system, coprocessor instructions will
- execute somewhat more slowly than on the 3C87. At present, the IIT
- 3C87SX is offered at speeds of 16, 20, 25, and 33 MHz. The IIT 3C87SX
- and the Cyrix 83S87 are the only 387SX-type math coprocessors that
- come in a 33 MHz version. The 3C87SX is packaged in a 68-pin PLCC.
-
-
- IIT 487DLC
-
- Reports in Internet NEWS seems to indicate that this chip can be
- used interchangeably with the IIT 3C87.
-
-
-
- Cyrix FasMath 83D87
-
- This chip was introduced in 1989, only shortly after the coprocessors
- from IIT. It has been found to be the fastest 387-compatible coprocessor
- in several benchmark comparisons [1,7,68,69]. It also came out as the
- fastest coprocessor in my own tests (see benchmark results below).
- Although the Cyrix 83D87 provides up to 50% more performance than the
- Intel 387DX in benchmarks comparisons, the speed advantage over other
- 387-compatible coprocessors in real applications is usually much
- smaller, because coprocessor instructions represent only a small part of
- the total application code. For example, in a test using the program 3D-
- Studio, the Cyrix 83D87 was 6% faster than the Intel 387DX [1].
-
- Besides being the fastest 387 coprocessor, the 83D87 also offers the
- most accurate transcendental functions results of all coprocessors
- tested (see test results below). The new "387+" version of the 83D87,
- available since November 1991, even surpasses the level of accuracy of
- the original 83D87 design. Note that the name 387+ is used in European
- distribution only. In other parts of the world, the new chip still goes
- by the name 83D87.
-
- Unlike Intel's coprocessors, which use the CORDIC [18,19] algorithm to
- compute the transcendental functions, Cyrix uses polynomial and rational
- approximations to the functions. In the past the CORDIC method has been
- popular since it requires only shifts and adds, which made it relatively
- easy to implement a reasonably fast algorithm. Recently, the cost for the
- implementation of fast floating-point hardware multipliers has dropped
- significantly (due to the availability of VLSI), making the use of
- polynomial and rational approximations superior to CORDIC for the
- generation of transcendental functions [61]. The Cyrix 83D87 uses a fast
- array multiplier, making its transcendental functions faster than those
- of any other 387 compatible coprocessor. It also uses 75 bit for the
- mantissa in intermediate calculations (as opposed to 68 bits on other
- coprocessors), making its transcendental functions more accurate than
- those of any other coprocessor or FPU (see results below).
-
- The 83D87 (and its successor, the 387+) are the 387 'clones' with the
- highest degree of compatibility to the Intel 387DX. A few minor software
- and hardware incompatibilities have been documented by Cyrix [12]. The
- software differences are caused by some bugs present in the 387DX that
- Cyrix fixed in the 83D87. Unlike the Intel 387DX, the 83D87 (and all
- other 387-compatible chips as well) does not support asynchronous
- operation of CPU and coprocessor. There were also problems in the past
- with the CPU-coprocessor communications, causing the 83D87 to
- occasionally hang on some machines. The reason behind this was that
- Cyrix shaved off a wait state in the communication protocol, which
- caused a communications breakdown between the CPU and the 83D87 for some
- systems running at 25 MHz or faster. (One notable example of this
- behavior was the Intel 302 board.) Also there were problems with boards
- based on early revisions of the OPTI chipset. These problem are only
- rarely encountered with the current generation of 386 motherboards, and
- it is possible that it has been entirely eliminated in the 387+, the
- successor to the 83D87.
-
- To reduce power consumption the 83D87 features advanced power saving
- features. Those portions of the coprocessor that are not needed are
- automatically shut down. If no coprocessor instructions are being
- executed, *all* parts except the bus interface unit are shut down [12].
- Maximal power consumption of the Cyrix 83D87 at 33 MHz is 1900 mW, while
- typical power consumption at this clock frequency is 500 mW [15].
-
-
- Cyrix EMC87
-
- This coprocessor is basically a special version of the Cyrix 83D87,
- introduced in 1990. In addition to the normal 387 operating mode, in
- which coprocessor-CPU communication is handled through reserved IO
- ports, it also offers a memory-mapped mode of operation similar to the
- operation principle of the Weitek Abacus. Like the Weitek chip, the
- EMC87 occupies a block of memory starting at physical address C0000000h
- (the Abacus occupies a memory block of 64 KB, while the EMC87 uses only
- 4 KB [77]). It can therefore only be accessed in the protected or
- virtual modes of the 386 CPU. DOS programs can access the EMC87 with the
- help of DOS extenders or memory managers like EMM386 which run in
- protected/virtual mode themselves. To implement the memory-mapped
- interface, the usual 80x87 architecture has been slightly expanded with
- three additional registers and eleven additional instructions that can
- only be used if the memory-mapped mode is enabled.
-
- Using this special mode of the EMC87 provides a significant speed
- advantage. The traditional 387 CPU-coprocessor interface via IO ports
- has an overhead of about 14-20 clock cycles. Since the Cyrix 83D87
- executes some operations like addition and multiplication in much less
- time, its performance is actually limited by the CPU-coprocessor
- interface. Since the memory-mapped mode has much less overhead, it
- allows all coprocessor instructions to be executed at full speed with no
- penalty.
-
- Originally, Cyrix claimed support for the fast memory-mapped mode of the
- EMC87 from a number of software vendors (including Borland and
- Microsoft). However, there are only very few applications that make use
- of it, among them Evolution Computing's FastCAD 3D, MicroWay Inc.'s NDP
- FORTRAN-386 compiler, Metaware's High-C compiler version 1.6 and newer,
- and Intusofts's Spice [63,73]. Part of the problem in supporting the
- memory-mapped mode is that the application must reserve one of the
- general purpose registers of the CPU to use memory-mapped mode
- instructions that access memory.
-
- (Note that the EMC87 is *not* compatible with Weitek's Abacus
- coprocessor. They both use the same CPU interface technique [memory
- mapping], but while the EMC87 uses the standard 387 instruction set, the
- Weitek Abacus coprocessors use a different instruction set entirely its
- own.)
-
- Since the EMC87 provides also the standard 386/387 CPU interface via IO
- ports, it can be used just like any other 387-compatible coprocessor and
- delivers the same performance as the Cyrix 83D87 in this mode. The EMC87
- even allows mixed use of memory-mapped and traditional instructions in
- the same code. Cyrix has also implemented some additional instructions
- in the EMC87 that are also available in the 387-compatible mode:
- FRICHOP, FRINT2, and FRINEAR. These instructions enable rounding to
- integer without setting the rounding mode by manipulating the
- coprocessor control word, and are intended to make life easier for
- compiler writers.
-
- In a test, the EMC87 at 33 MHz ran the single-precision Whetstone
- benchmark at 7608 kWhetstones/sec, while the Cyrix 83D87 at 33 MHz had a
- speed of only 5049 kWhetstones/sec, an increase of 50.6% [63]. In
- another test, the EMC87 ran a fractal computation at twice the speed of
- the Cyrix 83D87 and 2.6 times as fast as an Intel 387DX [64]. A third
- test found the EMC87's overall performance to be 20% higher than the
- performance of the Cyrix 83D87 [65].
-
- The Cyrix FasMath EMC87 has also been marketed as Cyrix AutoMATH; the
- two chips are identical. Unlike the Cyrix 83D87, which fits into the 68-
- pin 387 coprocessor socket, the EMC87 comes in a 121-pin PGA and
- requires the 121-pin EMC (Extended Math Coprocessor) socket. Note that
- not all boards have such a socket (a notable exception being IBM's
- PS/2s, for example). The EMC87 is available 25 and 33 MHz versions.
- Maximum power consumption at 33 MHz is 2000 mW.
-
- Cyrix phased out the EMC87 in 1992 and it is no longer available from
- chip dealers.
-
-
- Cyrix FasMath 387+
-
- This chip is the second-generation successor to the Cyrix 83D87. (The
- name "387+" is only used for European distribution; in other parts of
- the world, it goes by the original 83D87 designation.) According to a
- source within Cyrix [73], the 387+ was designed to make a smaller (and
- thus cheaper to manufacture) coprocessor chip that could also be pushed
- to higher frequencies than the original chip: the 387+ is available in
- versions of up to 40 MHz, whereas the original 83D87 could go no faster
- than 33 MHz.
-
- The Cyrix 387+ is ideally suited to be used with Cyrix's 486DLC CPU,
- which is a 486SX compatible replacement chips for the Intel 386DX.
- Indeed Cyrix sells upgrade kits consisting of a 486DLC CPU and a
- Cyrix 387+.
-
- In my tests, I found the Cyrix 387+ to be about five to 10 percent
- *slower* than the Cyrix 83D87. However, some instructions like the
- square root (FSQRT) now run at only half the speed at which they ran in
- the 83D87, and most transcendental functions show about a 40% drop in
- performance compared to their 83D87 averages (see performance results,
- below). However, I did find the transcendental functions on the 387+ to
- be a bit *more* accurate than those implemented in the 83D87. The new
- design uses a slower hardware multiplier that needs six clock cycles to
- multiply the floating-point mantissa of an internal precision number,
- while the multiplier in the 83D87 takes only 4 clocks to accomplish the
- same task. Since the transcendental functions in Cyrix math coprocessors
- are generated by polynomial and rational approximations, this slows them
- down significantly.
-
- The divide/square root logic has also been changed from the 83D87
- design. The original design used an algorithm that could generate both
- the quotient and square root, so the execution times for these
- instructions were nearly identical. The algorithm chosen for the
- division in the 387+ doesn't allow the square root to be taken so
- easily, so it takes nearly twice as long.
-
- In the 387+, the available argument range for the FYL2XP1 instruction
- has been extended, from the usual range -1+sqrt(2)/2..sqrt(2)/2 that is
- found on all 80x87 coprocessors, to include all floating-point numbers.
- Also, four additional instructions have been implemented: FRICHOP
- (opcode DD FC), FRINT2 (opcode DB FC), FRINEAR (opcode DF FC), and FTSTP
- (opcode D9 E6).
-
-
- Cyrix FasMath 83S87
-
- The 83S87 is the SX version of the Cyrix 83D87. Just as the 83D87 is the
- fastest 387-compatible coprocessor, the Cyrix 83S87 is the fastest of
- the 387SX compatible coprocessors [1], as well as providing the most
- accurate transcendental functions. 83S87 chips manufactured after 1991
- use the internals of the Cyrix 387+, the successor to the original 83D87
- [73] (above). The Cyrix 83S87 is ideally suited to be used with the
- Cyrix Cx486SLC CPU, a 486SX compatible CPU which is a replacement chip
- for the Intel 386SX CPU.
-
- The 83S87 is packaged in a 68-pin PLCC and is available in 16, 20, 25,
- and 33 MHz versions. Due to the advanced power saving features of the
- Cyrix coprocessor, the typical power consumption of the 20 MHz version
- is only about 350 mW [67], while maximum power dissipation is 1.6 W [80].
-
-
- ULSI Math*Co 83C87
-
- The ULSI 83C87 is an 80387-compatible coprocessor first introduced in
- early 1991, well after the IIT 3C87 and Cyrix 83D87 appeared. Like other
- 387 clones, it is somewhat faster than the Intel 387DX, particularly in
- its basic arithmetic functions. The transcendental functions, however,
- show only a slight speed improvement over the Intel 387DX (see benchmark
- results below).
-
- In my tests, the ULSI had the most inaccurate transcendental functions
- of all tested coprocessors. However, the maximum relative error is still
- within the limits set by Intel, so this is probably not an important
- issue for all but a very few applications. The ULSI 83C87 shows some
- minor flaws in the tests for IEEE 754 compatibility, but this, too, is
- probably unimportant under typical operating conditions. ULSI claims
- that the program IEEETEST, which was used to test for IEEE
- compatibility, contains many personal interpretations of the IEEE
- standard by the program's author and states that there is no ANSI-
- certified IEEE-754 compliance test. While this may be true, it is
- also a fact that the IEEE test vectors used in IEEETEST are a de facto
- industry standard, and that Intel's 387, 486, and RapidCAD chips pass it
- without a single failure, as do the coprocessors from Cyrix. Since the
- ULSI Math*Co 83C87 fails some of the tests, it is certainly less than
- 100% compatible with Intel's chips, although this will likely make
- little or no difference in typical operating conditions. (It is
- interesting to note that an ULSI 83S87 manufactured in 92/17 showed
- fewer errors in the IEEETEST test run [74] than the ULSI 83C87,
- manufactured in 91/48, I used in my original test. This indicates that
- ULSI might have applied some quick fixes to newer revisions of their
- math coprocessors.)
-
- The ULSI 83C87 fails to be compatible with the IEEE-754 in that is does
- not implement the "precision control" feature. While all the internal
- operations of 80x87 coprocessors are usually performed with the maximum
- precision available (double-extended precision with 64 mantissa bits),
- the 80x87 architecture also offer the possibility to force lower
- precision to be used for the basic arithmetic functions (add, subtract,
- multiply, divide, and square root). This feature is required by IEEE-754
- for all coprocessors that can not store results *directly* to a single
- or double-precision location. Since 80x87 coprocessors lack this storage
- capability, they all implement precision control to provide correctly
- rounded single- and double-precision results according to the floating-
- point standard - except the ULSI chips. For programs that make use of
- precision control (e.g., Interactive UNIX), correct implementation of
- the feature may be essential for correct arithmetic results.
-
- It seems to be confirmed by the numerous postings on Internet that
- using an ULSI math coprocessor with protected mode operating systems
- will result in system lockup once tasks using the math coprocessor are
- run. This seems to be the result of a bug in the FSAVE and FRSTOR
- instructions in 32-bit protected mode. These instructions are used to
- save and restore the math coprocessor state for the purpose of switching
- coprocessor contents between two tasks. OS/2 and Linux are two operating
- systems that have been explicitly mentioned as having locked up if a
- ULSI math coprocessor is used, but run fine with other math coprocessors.
- ULSI is supposedly aware of the problem. So far, no fixes seem to have
- been introduced in newer ULSI math coprocessors to remedy the problem.
- Therefore it seems unlikely that ULSI will eventually introduce such
- bug fixes.
-
- Like other non-Intel 387 compatibles, the 83C87 does not support
- asynchronous operation of the CPU and the coprocessor. This means that
- the 83C87 always runs at the full speed of the CPU. It is available in
- 20, 25, 33, and 40 MHz versions. The ULSI is produced in low power CMOS;
- power consumption at 20 MHz is max. 800 mW (400 mW typical), at 25 MHz
- it is max. 1000 mW (500 mW typical), at 33 MHz it is max. 1250 mW (625
- mW), and at 40 MHz it is max. 1500 mW (750 mW typical) [58]. The 83C87
- is packaged in a 68-pin ceramic PGA.
-
- ULSI coprocessors come with a lifetime warranty. ULSI Systems, Inc.,
- will replace the coprocessor up to three times free of charge should it
- ever fail to function properly.
-
-
- ULSI Math*Co 83S87
-
- This chip is the SX version of the ULSI 83C87, for use in systems with
- an Intel 387SX or an AMD Am387SX CPU. It is functionally equivalent to
- the 83C87. To aid low-power laptop designs, the ULSI 83S87 features an
- advanced power saving design with a sleep mode and a standby mode with
- only minimal power requirements. Power consumption under normal
- operating conditions (dynamic mode) is max. 400 mW at 16 MHz (300 mW
- typical), max. 450 mW at 20 MHz (350 mW typical), and max. 500 mW at 25
- MHz (400 mW typical) [58]. The ULSI 83S87 is packaged in a 68-pin PLCC.
-
-
- ULSI DX/DLC
-
- This math coprocessor seems to be a slightly enhanced version of the
- original ULSI 83C87. Some incompatibilities with respect to the IEEE
- 754 standard for floating-point arithmetic have been removed, but the
- the chip still doesn't pass IEEETEST without mismatches. Some of the
- transcendental functions have been sped up somewhat. Other than that,
- I couldn't find any significant changes.
-
-
- C&T SuperMATH 38700DX
-
- Produced by Chips&Technologies, this was the latest entry into the 387-
- compatible marketplace. Originally announced in October, 1991, it had
- apparently not been available to end-users before the third quarter of
- 1992. The product was discontinued after only a few month since C&T
- stopped all work on their CPU and coprocessor development. My tests
- show that its compatibility with Intel products is very good, even
- for the more arcane features of the 387DX and comparable to the
- coprocessors from Cyrix. Like these chips, it passes the IEEETEST
- program without a single failure. It passes, of course, all tests in
- Chips&Technologies' own compatibility test program, SMDIAG. However,
- some of the tests (the transcendental functions) in this program are
- selected in such a way that the C&T 38700 passes while the Cyrix 83D87
- or Intel RapidCAD fail, so they are not very useful. (There is also a
- 'bug' in the test for FSCALE that hides a true bug in the C&T 38700.)
- My tests show the accuracy of the transcendental functions on the C&T
- 38700DX varies. Overall, accuracy of the transcendentals is slightly
- better than on the Intel 387DX.
-
- In my own speed tests [see below] and those reported in [1], the C&T
- 38700DX showed performance at about 90-100% the level of the Cyrix
- 83D87, which is the 387 clone with the highest performance. For
- floating-point-intensive benchmarks, the C&T 38700DX provides up to 50%
- more computational performance than the Intel 387DX. However, as with
- all other 387 compatible coprocessors, the speed advantage over the
- Intel 387DX is far less significant in real applications.
-
- The SuperMATH 38700DX is implemented in 1.2 micron CMOS with on-chip
- power management, which makes for low power consumption. The 38700DX is
- packaged in a 68-pin ceramic PGA (pin grid array and available in speeds
- of 16, 20, 25, 33, and 40 MHz.
-
-
- C&T 38700SX
-
- This chip is the SX version of the 38700DX and compatible with the Intel
- 387SX. It provides performance comparable to a Cyrix 83S87 [1], the
- 387SX clone with the highest performance. Compatibility with the Intel
- 387SX is very good and on par with the high degree of the compatibility
- found in the Cyrix 83S87.
-
- The 38700SX has low power consumption. It is packaged in a 68-pin PLCC
- (plastic leaded chip carrier) and available in speeds of 16, 20, and 25
- MHz.
-
- This chip is no longer available, since C&T stopped all work on their
- 386/387 compatible chips in early 1993.
-
-
- Intel RapidCAD
-
- The RapidCAD is not a coprocessor, strictly seen, although it is
- marketed as one. Rather, it is a full replacement for a 80386 CPU:
- basically, an Intel 486DX CPU chip without the internal cache and with a
- standard 386 pinout. RapidCAD is delivered as a set of two chips.
- RapidCAD-1 goes into the 386 socket and contains the CPU and FPU.
- RapidCAD-2 goes into the coprocessor (387) socket and contains a simple
- PAL whose only purpose is to generate the FERR signal normally generated
- by a coprocessor (This is needed by the motherboard circuitry to provide
- 287 compatible coprocessor exception handling in 386/387 systems.) The
- RapidCAD instruction set is compatible with the 386, so it doesn't have
- any newer, 486-specific instructions like BSWAP. However, since the
- RapidCAD CPU core is very similar to 80486 CPU core, most of the
- register-to-register instructions execute in the same number of clock
- cycles as on the 486.
-
- RapidCAD's use of the standard 386 bus interface causes instructions
- that access memory to execute at about the same speed as on the 386. The
- integer performance on the RapidCAD is definitely limited by the low
- memory bandwidth provided by this interface (2 clock cycles per bus
- cycle) and the lack of an internal cache. CPU instructions often execute
- faster than they can be fetched from memory, even with a big and fast
- external cache. Therefore, the integer performance of the RapidCAD
- exceeds that of a 386 by *at most* 35%. This value was derived by
- running some programs that use mostly register-to-register operations
- and few memory accesses, and is supported by the SPEC ratings that Intel
- reports for the 386-33 and the RapidCAD-33: while the 386-33 has a
- SPECint of 6.4, the RapidCAD has a SPECint of 7.3 [28], a 14% increase.
- (Note that these tests used the old [1989] SPEC benchmarks suite.)
-
- While CPU and integer instructions often execute in one clock cycle on
- the RapidCAD, floating-point operations always take more than seven
- clock cycles. They are therefore rarely slowed down by the low-bandwidth
- 386 bus interface; My tests show a 70%-100% performance increase for
- floating-point intensive benchmarks over a 386-based system using the
- Intel 387DX math coprocessor. This is consistent with the SPECfp rating
- reported by Intel. The 386/387 at 33 MHz is rated at 3.3 SPECfp, while
- the RapidCAD is rated at 6.1 SPECfp at the same frequency, an 85%
- increase. This means that a system that uses the RapidCAD is faster than
- *any* 386/387 combination, regardless of the type of 387 used, whether
- an Intel 387DX or a faster 387 clone. The diagnostic disk for the
- RapidCAD also gives some application performance data for the RapidCAD
- compared to the Intel 387DX:
-
- Application Time w/ 387DX Time w/ RapidCAD Speedup
-
- AutoCAD 11 52 sec 32 sec 63%
- AutoShade/Renderman 180 sec 108 sec 67%
- Mathematica(Windows ) 139 sec 103 sec 35%
- SPSS/PC+ 4.01 17 sec 14 sec 21%
-
- RapidCAD is available in 25 MHz and 33 MHz versions. It is distributed
- through different channels than the other Intel math coprocessors, and I
- have therefore been unable to obtain a data sheet for it. [78] gives the
- typical power consumption of the 33 MHz RapidCAD as 3500 mW, which is
- the same as for the 33 MHz 486DX. The RapidCAD-1 chip gets quite hot
- when operating. Therefore, I recommend extra cooling for it (see the
- paragraph below on the 486 for details). The RapidCAD-1 is packaged in a
- 132-pin PGA, just like the 80386, and the RapidCAD-2 is packaged in a
- 68-pin PGA like a 80387 coprocessor.
-
-
- Intel 486DX
-
- The Intel 486DX is, of course, not solely a coprocessor. This chip,
- first introduced by Intel in 1989, functionally combines the CPU (a
- heavily-pipelined implementation of the 386 architecture) with an
- enhanced 387 (the chip's floating-point unit, FPU) and 8 KB of unified
- on-chip code/data cache. (This description is necessarily simplified;
- for a detailed hardware description, see [52].) The 486DX offers about
- two to three times the integer performance of a 386 at the same clock
- frequency, while floating-point performance is about three to four times
- as high as the Intel 387DX at the same clock rate [29]. Since the FPU is
- on the same chip as the CPU, the considerable communication overhead
- between CPU and coprocessor in a 386/387 system is omitted, letting FPU
- instructions run at the full speed permitted by the implementation. The
- FPU also takes advantage of the on-chip cache and the highly pipelined
- execution unit. The concurrent execution of CPU and coprocessor
- instructions typical for 80x86/80x87 systems is still in existence on
- the 486, but some FPU instructions like FSIN have nearly no concurrency
- with CPU instructions, indicating that they make heavy use of both, CPU
- and FPU resources [53, 1].
-
- Besides its higher performance, the 486 FPU provides more accurate
- transcendental functions than the 387DX coprocessor, according to my
- tests (see below). To achieve better interrupt latency, FPU instructions
- with a long execution times have been made abortable if an interrupt
- occurs during their execution.
-
- Due to the considerable amount of heat produced by these chips, and
- taking into consideration the slow air flow provided by the fan in
- garden-variety PC tower cases, I recommend an extra fan directly above
- the CPU for safer operation. If you measure the surface temperature of
- an 486DX after some time of operation in a normal tower case without
- extra cooling, you may well come up with something like 80-90 degrees
- Celsius (that is 175-195 degrees Fahrenheit for those not familiar with
- metric units) [54,55]. You don't need the well known (and expensive)
- IceCap[tm] to effectively cool your CPU; a simple fan mounted directly
- above the CPU can bring the temperature of the chip down to about 50-60
- degrees Celsius (120-140 degrees Fahrenheit), depending on the room
- temperature and the temperature within the PC case (which depends on the
- total power dissipation of all the components and the cooling provided
- by the fan in the system's power supply). According to a simple rule
- known as Arrhenius' Law, lowering the temperature by 10 degrees Celsius
- slows down chemical reactions by a factor of two, so lowering the
- temperature of your CPU by 30 degrees should prolong the life of the
- device by a factor of eight, due to the slower ageing process. If you
- are reluctant to add a fan to your system because of the additional
- noise, settle for a low-noise fan like those available from the German
- manufacturer Pabst (this is not meant to be an advertisement; I am just
- the happy owner of such a fan, and have no other connections to the
- firm).
-
- The 486DX comes in a 168 pin ceramic PGA (pin grid array). It is
- available in 25 MHz and 33 MHz versions. Since the end of 1991, a 50 MHz
- version has also been available, manufactured in a CHMOS V process (the
- 25 MHz and 33 MHz are produced using the CHMOS IV process). Maximum
- power consumption is 3500 mW for the 25 MHz 486 (2600 mW typical), 4500
- mW for the 33 MHz version (3500 mW typical), and 5000 mW (3875 mW
- typical) for the 50 MHz chip.
-
-
- Intel 486DX2
-
- The 486DX2 represents the latest generation of Intel CPUs. The "DX2"
- suffix (instead of simply DX) is meant to be an indicator that these are
- clock-doubled versions of the basic CPU. A normal 486DX operates at the
- frequency provided by the incoming clock signal. A 486DX2 instead
- generates a new clock signal from the incoming clock by means of a PLL
- (phase locked loop). In the DX2, this clock signal has twice the
- frequency of the incoming clock, hence the name clock-doubler. All
- internal parts of the 486DX2 (cache, CPU core, and FPU) run at this
- higher frequency; only the bus interface runs at the normal (undoubled)
- speed. Using this technique, an Intel 486DX2-50 can run on an unmodified
- motherboard designed for 25 MHz operation. Since motherboards which run
- at 50 MHz are much harder to design and build than those for 25 MHz,
- this makes a 486DX2-50 system cheaper than an 'equivalent' 486DX-50
- system.
-
- For all operations that don't access off-chip resources (e.g., register
- operations), a 486DX2-50 provides exactly the same performance as a
- 486DX-50, and twice the performance of a 486DX-25. However, since the
- main memory in a 486DX2-50 systems still operates at 25 MHz, all
- instructions involving memory accesses are potentially slower than in a
- 486DX-50 system, whose memory also (presumably) runs at 50 MHz. The
- internal cache of the 486 helps this problem a bit, but overall
- performance of a 486DX2-50 is still lower than that of a 486DX-50.
- Intel's documentation [32] shows this drop to be quite small, although
- it is highly dependent upon the particular application.
-
- The truly wonderful thing about the 486DX2 is that it allows easy
- upgrading of 25 and 33 MHz 486 systems, since the 486DX2 is completely
- pin-compatible with the 486DX: you need just take out the 486DX and plug
- in the new 486DX2. Note that power consumption of the 486DX2-50 equals
- that of the 486DX-50 (4000 mW typical, 4750 mW max.), and that the
- 486DX2-66 exceeds this by about 25% (4875 mW typical, 6000 mW max.).
- These chips get *really* hot in a standard PC case with no extra
- cooling, even if they come with an attached heat sink by default. (See
- the discussion above for more detailed information on this problem and
- possible solutions).
-
-
- Intel 487SX
-
- The 487SX is the math coprocessor intended for use in 486SX systems. The
- 486SX is basically a 486DX without the floating-point unit (FPU) [48,
- 50]. (Originally Intel sold 486DXs with a defective FPU as 486SXs but it
- has now completely removed the FPU part from the 486SX mask for mass
- production.) The introduction of the 486SX in 1991 has been viewed by
- many as a marketing 'trick' by Intel to take market share from the 386
- based systems once AMD became successful with their Am386. (AMD has
- taken as much as 40% of the 386 market due to some superior features
- such as higher clock frequency, lower power consumption, fully static
- design, and availability of a 3V version). A 486SX at 20 MHz delivers
- a bit less integer performance than a 40 MHz Am386.
-
- To add floating-point capabilities to a 486SX based system, it would
- seem to be easiest to swap the 486SX for a 486DX, which includes the FPU
- on-chip. However, Intel has prevented this easy solution by giving the
- 486SX a slightly different pin out [48, 51]. Since only three pins are
- assigned differently, clever board manufacturers have come out with
- boards that accept anything from a 486SX-20 to a 486DX2-50 in their CPU
- socket and by doing so provide a clean upgrade path. A set of three
- jumpers ensures correct signal assignment to the changed pins for either
- CPU type. To upgrade 486SX systems without this feature, you are forced
- to buy a 487SX and install it in the "Performance Upgrade Socket"
- (present in most systems).
-
- Once the 487SX was available, it was quickly found out that it is just a
- normal 486DX with a slightly different pinout [49]. Technically
- speaking, the solution Intel chose was the only practical way to provide
- a 486SX system with the high level of floating-point performance the
- 486DX offers. The CPU and FPU must be on the same chip; otherwise, the
- FPU cannot make use of the CPU's internal cache and there would be
- considerable overhead in CPU-FPU communication (similar to a 386/387
- system), nullifying most of the arithmetic speedups over the 387. That
- the 486SX, 487SX, and 486DX are *not* pin-compatible seems to be purely
- for marketing reasons.
-
- To upgrade a 486SX based system, Intel also offers the OverDrive chip,
- which is just the same as a 487SX with internal clock doubling. It also
- goes into the motherboard's "Performance Upgrade Socket". The OverDrive
- roughly doubles the performance of a 486SX/487SX based system. (For a
- explanation of clock doubling, see the description of the Intel 486DX2
- above.)
-
- Inserting the 487SX effectively shuts down the 486SX in the 486SX/487SX
- system, so the 486SX could be removed once the 487SX is installed. Since
- the shut down is logical, not electrical, the 486SX still uses power if
- used with the 487SX, although it is inoperational. As with the 486SX,
- the 487SX is currently available in 20 MHz and 25 MHz versions. At 20
- MHz, the 487SX has a power consumption of max. 4000 mW (3250 mW
- typical). It is available in a 169 pin ceramic PGA (pin grid array).
-
-
- Weitek 1167
-
- This math coprocessor was the predecessor of the Weitek Abacus 3167. It
- was actually a small printed circuit board with three chips mounted on
- it. In contrast to the Weitek 3167, the 1167 did not have a square root
- instruction; instead, the square root function was computed by means of
- a subroutine in the Weitek transcendental function library. However, the
- 1167 did have a mode in which it supported denormal numbers. (The Weitek
- 3167 and 4167 only implement the 'fast' mode, in which denormals are not
- supported.) Overall performance of the 1167 is slightly less than that
- of the Weitek 3167.
-
-
- Weitek 3167
-
- The 3167 was introduced by Weitek in 1989 and provided the fastest
- floating-point performance possible on a 386 based system at that time.
- The 3167 is not a real coprocessor, strictly speaking, but rather a
- memory-mapped peripheral device. The architecture of the 3167 was
- optimized for speed wherever possible. Besides using the faster memory
- mapped interface to the CPU (the 80x87 uses IO-ports), it does not
- support many of the features of the 80x87 coprocessors, allowing all of
- the chip's resources to be concentrated on the fast execution of the
- basic arithmetic operations. (For a more detailed description of the
- Weitek 3167, see the first chapter of this document.)
-
- In benchmark comparisons, the Weitek 3167 provided up to 2.5 times the
- performance of an Intel 387DX coprocessor. For example, on a 33 MHz 3167
- the Whetstone benchmark performed at 7574 kWhetstones/sec compared with
- the 3743 kWhetstones/s for the Intel 387DX. (Note, however, that these
- are single-precision results and that the Weitek 3167's performance
- would drop to about half the stated rate for double-precision, while the
- value for the Intel 387DX would change very little.) In any case, before
- the advent of the Intel RapidCAD, the Weitek 3167 usually outperformed
- all 387-compatible coprocessors, even for double-precision operations
- [63,65,69]. For typical applications, the advantage of the Weitek 3167
- over the 387 clones is much smaller. In a benchmark test using
- AutoDesk's 3D-Studio the Weitek 3167 performed at 123% of the Intel
- 387DX's performance compared with 106% for the Cyrix FasMath 83D87 and
- 118% for the Intel RapidCAD.
-
- The Weitek Abacus 3167 is packaged in a 121-pin PGA that fits into an
- EMC socket (provided in most 386-based systems). It does *not* fit into
- the normal 68-pin PGA socket intended for a 387 coprocessor.
-
- To get the best of both worlds, one might want to use a Weitek 3167 and
- a 387 compatible coprocessor in the same system. These coprocessors can
- coexist in the same system without problems; however, most 386-based
- systems contain only one coprocessor socket, usually of the EMC
- (extended math coprocessor) type. Thus, you can install either a 387
- coprocessor or a Weitek 3167, but not both at the same time. There *are*
- small daughter boards available that plug into the EMC socket and
- provide two sockets, an EMC and a standard coprocessor socket.
-
- At 25 MHz, the Weitek 3167 has a power consumption of max. 1750 mW. At
- 33 MHz, max. power consumption is 2250 mW.
-
-
- Weitek 4167
-
- The 4167 is a memory-mapped coprocessor that has the same architecture
- as the 3167; it is designed to provide 486-based systems with the
- highest floating-point performance available. It executes coprocessor
- instructions at three to four times the speed of the Weitek 3167.
- Although it is up to 80% faster than the Intel 486 in some benchmarks
- [1,69], the performance advantage for real application is probably more
- like 10%. The introduction of the 486DX2 processors has more or less
- obliterated the need for a Weitek 4167, since the DX2 CPUs provide the
- same performance as the Weitek, as well as the additional features the
- 80x87 architecture has that the Weitek does not.
-
- The Weitek 4167 is packaged in a 142-pin PGA package that is only
- slightly smaller than the 486's package. At 25 MHz, it has a max. power
- consumption of 2500 mW [32].
-
-
-
- ======================================
- Finding out which coprocessor you have
- ======================================
-
- If you are interested in programming techniques which allow the detection and
- differentiation of the coprocessors described above, I refer you to my
- COMPTEST program. COMPTEST reliably detects the type and clock frequency of
- most CPUs and coprocessors installed in your machine. The current version is
- CTEST259.ZIP, with future versions to be called CTEST260, CTEST261 and so on.
- COMPTEST can correctly identify all of the coprocessors described above, with
- the exception of the Weitek chips, for which the detection mechanism is not
- that reliable.
-
- COMPTEST is in the public domain and comes with complete source code. It is
- available via anonymous ftp from garbo.uwasa.fi and additional ftp sites that
- mirror garbo.
-
-
-
- ================================================
- Current coprocessor prices and purchasing advice
- ================================================
-
- Due to mid-1992 price slashing by Cyrix (and subsequently, Intel) for 387
- coprocessors, prices have dropped significantly for all 287 and 387
- compatibles, with hardly any price difference between manufacturers. 387DX
- compatible coprocessors typically sell for ~US$ 70 for all speeds except for
- 40 MHz versions, which are typically ~US$ 80. 387SX compatible coprocessors
- sell for ~US$ 65, regardless of speed, with the exception of the 33 MHz
- versions, which are ~US$ 70. The Intel 287XL sells for ~US$ 60, while the
- IIT 2C87 and Cyrix 82S87 each sell for about US$ 50. 8087s may be more
- expensive, the price of an 8087-10 being ~US$ 85. The Intel RapidCAD sells
- for about US$ 240 now. The Weitek Abacus 3167-33 is being offered for US$ 170
- and the is 4167-33 being offered for US$ 300. The Intel 486SX OverDrive is
- available for ~US$ 450 for the 33 MHz version, while the Intel 486DX2-66 costs
- ~400 US$. This price information reflects the price situation as of 05-01-94;
- prices can be expected to drop slightly in the near future. Also prices can
- vary locally, so take the above as rough ideas what to expect if you go out
- and buy a math coprocessor
-
-
- Which coprocessor should you buy?
- ---------------------------------
- Several computer magazines have published application-level performance
- comparisons for various 387 coprocessors and Weitek's ABACUS 3167 and 4167
- chips [1,25,68,70]. Applications tested included AutoCAD R11, RenderStar,
- Quattro Pro, Lotus 1-2-3, and AutoDesk's 3D-Studio. For most tests,
- performance improvements for the 387 clones over Intel's 387DX were small to
- marginal, the clones running the applications no more than 5-15% faster than
- the Intel 387DX. In the test of 3D-Studio, one of the few programs that
- directly supports the Weitek Abacus, the Weitek 3167 improved performance by
- 23% over an Intel 387DX and the 4167 improved performance by 10% over the
- 486DX [1].
-
- If you have a high demand for floating-point performance, consider buying
- a system based on the 486DX and 486DX2 CPUs from Intel, AMD, or Cyrix (or
- even a Intel Pentium machine if you have enough money to spend), rather
- than a motherboard based on the Intel 386DX, AMD 386DX, Cyrix 486DLC,
- TI 486SXL, IBM 486SLC2, or IBM Blue Lightning with an additional math
- coprocessor. With a math coprocessor that is external to the CPU, there is
- a lot of communication overhead which limits floating-point performance,
- even if the CPU is clock-doubled or clock-tripled and the math coprocessor
- is also clock-doubled. So while the integer performance of such systems
- reaches 486DX levels in many cases, the floating-point performance is still
- significantly below 486DX performance. Currently, price/performance arguments
- do not apply here, though. An AMD 386DX/40 MHz or Cyrix 486DLC based ISA
- motherboard including the math coprocessor currently sells for ~US$ 180.
- A 486DX-40 MHz VLB motherboard based on CPUs from AMD or Cyrix sells for
- US$ 350. So with regard to floating-point performance, one gets twice the
- performance for twice the price.
-
- If you want to push your 386-based system to its maximum floating-point
- performance and can't switch to a 486, I recommend the Intel RapidCAD
- chipset. It is faster [1] than installing a Weitek Abacus 3167 in a
- 386 system, which used to be the highest performing combination before
- the RapidCAD was introduced.
-
- In a similar vein, the introduction of the Intel 486DX2 clock-doubler chips
- has obliterated the need for a Weitek 4167 to get maximum floating-point
- performance out of a 486-based system. A 486DX2-66 performs at or above the
- performance level of a 33 MHz Weitek 4167, even if the latter uses single-
- precision rather than double-precision. The 486DX-66 is rated by Intel at
- 24700 double-precision kWhetstones/sec and 3.1 double-precision Linpack
- MFLOPS. (Of course, these benchmarks used the highest performance compilers
- available. But even with a Turbo Pascal 6.0 program, I managed to squeeze 1.6
- double-precision MFLOPS out of the 486DX2-66 for the LLL benchmark [for a
- description of these benchmarks, see the paragraph on benchmarks below].)
- With the introduction of the Intel Pentium, floating-point performance for
- 80x86 based machines has clearly climbed to workstation levels. While the
- integer performance of a 66 MHz Pentium system is only about twice that of a
- 486DX2-66, the floating-point performance is 3-4 times as high. I would
- recommend a Pentium based system to everybody with a need for extremly
- high floating-point performance. For people with a need for average to high
- floating-point performance I would recommend a system based on the 486DX2-66s
- from Intel, AMD, or Cyrix. Note that the Cyrix 486DX and 486DX2 CPUs offer
- somewhat higher floating-point performance than their Intel and AMD rivals.
-
-
-
- ============================================================
- The benchmark programs / Coprocessor performance comparisons
- ============================================================
-
- The performance statistics below were put together with the help of four
- widely-known numeric benchmarks and two benchmarks developed by me. Three
- Pascal programs, one FORTRAN program, and two assembly language programs were
- used. The assembly language programs were linked with Borland's Turbo Pascal
- 6.0 for library support, especially to include the coprocessor emulator of
- the TP 6.0 run-time library. The Pascal programs were compiled with Turbo
- Pascal 6.0, a non-optimizing compiler that produces 16-bit code. The FORTRAN
- program was compiled using Microsoft's FORTRAN 5.0, an optimizing compiler
- that generates 16-bit code. All programs use double-precision variables
- (except PEAKFLOP and SAVAGE, which use double extended precision).
-
- Note that the use of a highly optimizing compiler producing 32-bit code can
- give much higher performance for some benchmarks. For example, Intel rates
- the 33 MHz 386/387DX at 3290 kWhetstones/sec and 0.4 double-precision LINPACK
- MFLOPS [28,29], and it rates the Intel 486 at 12300 kWhetstones/sec and 1.6
- double-precision LINPACK MFLOPS [30]. The compilers used in these benchmarks
- run by the chip's manufacturer are the ones that give the highest performance
- available, and sell in the US$ 1000+ price range. Some of them may even be
- experimental or prereleased versions not available to the general public. The
- relative performance of one coprocessor to another can and does vary greatly
- depending on the code generated by compilers. Non-optimizing compilers tend
- to generate a high percentage of operations which access variables in memory,
- while optimizing compiler produce code that contains many operations
- involving registers. Thus it is well possible that coprocessor A beats
- coprocessor B running benchmark Z if compiled with compiler C, but B beats A
- when the same benchmark is compiled using compiler D.
-
- All benchmark in this overview were run from floppy under a 'bare-bones' MS-
- DOS 5.0 without the CONFIG.SYS and AUTOEXEC.BAT files. This way, it was made
- sure no TSR or other program unnecessarily stole computing resources from the
- benchmarks.
-
-
- Description of benchmarks
- -------------------------
- PEAKFLOP is the kernel of a fractal computation. It consists mainly of a
- tight loop written in assembly code and fine-tuned to give maximum
- performance. The whole program fits nicely into even a very small CPU cache.
- All variables are held in the CPU's and coprocessor's registers, so the only
- memory access is for opcode fetches. The main loop contains three
- multiplications and five additions/ subtractions; this ratio is fairly
- typical for other floating-point intensive programs as well. Due to the
- nature of this program, its MFLOPS rate is hardly to be exceeded by any
- program that calculates anything useful; thus the name PEAKFLOP. You will
- find the source code for PEAKFLOP in appendix B.
-
- TRNSFORM multiplies an array of 8191 vectors with a 3D-transformation matrix
- (a 4x4 matrix). Each vector consists of four double-precision values.
- Multiplying vectors with a matrix is a typical operation in the manipulation
- (e.g. rotation) of 3D objects which are made up from many vectors describing
- the object. This benchmark stresses addition and multiplication as well as
- memory access. For each vector, 16 multiplications and 12 additions are used,
- and about 256 KB of data is accessed during the benchmark run.
-
- For the IIT 3C87, a special version of TRNSFORM was written that makes use of
- the special F4X4 instruction available on that coprocessor. F4X4 does a full
- multiplication of a 4x4 matrix by a 4x1 vector in a single instruction.
- TRNSFORM is implemented as an optimized assembler program linked with the
- Turbo Pascal 6.0 library. The full source code can be found in appendix B.
-
- LLL is short for Lawrence Livermore Loops [21], a set of kernels taken from
- real floating-point extensive programs. Some of these loops are vectorizable,
- but since we don't deal with vector processors here, this doesn't matter. For
- this test, LLL was adapted from the FORTRAN original [20] to Turbo Pascal
- 6.0. By variable overlaying (similar to FORTRAN's EQUIVALENCE statement),
- memory allocation for data was reduced to 64 KB, so all data fits into a
- single 64 KB segment. The older version of LLL is used here which contains 14
- loops. There also exists a newer, more elaborate version consisting of 24
- kernels. The kernels in LLL exercise only multiplication and addition. The
- MFLOPS rate reported is the average of the MFLOPS rate of all 14 kernels.
- All floating-point variables in the programs are of type DOUBLE.
-
- Both LLL and Whetstone results (see below) are reported as returned by my
- COMPTEST test program, in which they have been included as a measure of
- coprocessor/FPU performance. COMPTEST has been compiled under Turbo Pascal
- 6.0 with all 'optimizations' on and using my own run-time library, which
- gives higher performance than the one included with TP 6.0. My library is
- available as TPL60N18.ZIP from garbo.uwasa.fi and ftp sites that mirror this
- site.
-
- Linpack [5] is a well known floating-point benchmark that also heavily
- exercises the memory system. Linpack operates on large matrices and takes up
- about 570 KB in the version used for this test. This is about the largest
- program size a pure DOS system can accommodate. Linpack was originally
- designed to estimate performance of BLAS, a library of FORTRAN subroutines
- that handles various vector and matrix operations. Note that vendors are
- free to supply optimized (e.g., assembly language) versions of BLAS. Linpack
- uses two routines from BLAS which are thought to be typical of the matrix
- operations used by BLAS. Both routines only use addition/subtraction and
- multiplication. The FORTRAN source code for Linpack can be obtained from
- the automated mail server netlib@ornl.gov. Linpack was compiled using MS
- FORTRAN 5.0 in the HUGE memory model (which can handle data structures
- larger than 64 KB) and with compiler switches set for maximum optimization.
- All floating-point variables in the program are of the DOUBLE type. Linpack
- performs the same test repeatedly. The number reported is the maximum MFLOPS
- rate returned by Linpack. Linpack MFLOPS ratings for a great number of
- machines are contained in [6]. This PostScript document is also available
- from netlib@ornl.gov.
-
- Whetstone [2,3,4] is a synthetic benchmark based upon statistics collected
- about the use of certain control and data structures in programs written in
- high level languages. Based on these statistics, it tries to mirror a
- 'typical' HLL program. Whetstone performance is expressed by how many
- hypothetical 'whetstone' instructions are executed per second. It was
- originally implemented in ALGOL. Unlike PEAKFLOP, LLL, and Linpack,
- Whetstone not only uses addition and multiplication but exercises all basic
- arithmetic operations as well as some transcendental functions. Whetstone
- performance depends on the speed of the CPU as well as on the coprocessor,
- while PEAKFLOP, LLL, and Linpack place a heavier burden on the coprocessor/FPU.
-
- There exist both old and new versions of Whetstone. Note that results from
- the two versions can differ by as much as 20% for the same test configuration.
- For this test, the new version in Pascal from [3] was used. It was compiled
- with Turbo Pascal 6.0 and my own library (see above) with all 'optimizations'
- on. All computations are performed using the DOUBLE type.
-
- SAVAGE tests the performance of transcendental function evaluation. It is
- basically a small loop in which the sin, cos, arctan, ln, exp, and sqrt
- functions are combined in a single expression. While sin, cos, arctan, and
- sqrt can be evaluated directly with a single 387 coprocessor instruction
- each, ln and exp need additional preprocessing for argument reduction and
- result conversion. According to [14], the Savage benchmark was devised by
- Bill Savage, and is distributed by: The Wohl Engine Company, Ltd., 8200 Shore
- Front Parkway, Rockaway Beach, NY 11693, USA. Usually, Savage is programmed
- to make 250,000 passes though the loop. Here only 10,000 loops are executed
- for a total of 60,000 transcendental function evaluations. The result is
- expressed in function evaluations per second. SAVAGE source code was taken
- from [7] and compiled with Turbo Pascal 6.0 and my own run-time library
- (see above).
-
-
-
- Benchmark results using the Intel 386DX CPU and various coprocessors
- --------------------------------------------------------------------
-
- My benchmark results for 387 coprocessors, coprocessor emulators and the
- Intel RapidCAD and Intel 486 CPUs, using the programs described above, on
- an Intel 386DX system:
-
-
- 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- Intel 386DX WITH:
- EM87 emulator 0.0070 0.0040 0.0050 0.0050 26 418 ##
- Franke387 emu. 0.0307 0.0246 0.0194 0.0179 137 3335 @@
- TP/MS-FORT emu 0.0263 0.0227 0.0167 0.0158 133 3160 %%
- Q387 emulator 0.0768 0.0583 0.0285 0.0288 251 7538 ((
- Intel 387DX 0.7647 0.6004 0.3283 0.2676 2046 43860
- ULSI 83C87 1.0097 0.6609 0.3239 0.2598 2089 47431
- IIT 3C87 0.8455 0.5957 0.3198 0.2646 2203 49020
- IIT 3C87,4X4 0.8455 1.4334 0.3198 0.2646 2203 49020 $$
- ULSI DX/DLC 1.0097 0.6628 0.3228 0.2496 2144 51107
- C&T 38700 0.9455 0.6907 0.3338 0.2700 2376 62565
- Cyrix 387+ 0.9286 0.6806 0.3293 0.2669 2435 66890
- Cyrix EMC87 1.0400 0.6628 0.3352 0.2808 2540 71685 //
-
- Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
- Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
-
-
-
- 40 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- Intel 386DX WITH:
- EM87 emulator 0.0084 0.0080 0.0060 0.0060 31 502 ##
- Franke387 emu. 0.0369 0.0295 0.0233 0.0215 164 4002 @@
- TP/MS-FORT emu 0.0316 0.0273 0.0200 0.0190 160 3794 %%
- Q387 emulator 0.0922 0.0700 0.0342 0.0345 302 9053 ((
- Intel 387DX 0.9204 0.7212 0.3932 0.3211 2428 52677
- ULSI 83C87 1.2093 0.7936 0.3890 0.3120 2528 56926
- IIT 3C87 1.0196 0.7145 0.3834 0.3179 2663 58766
- IIT 3C87,4x4 1.0196 1.7244 0.3834 0.3179 2663 58766 $$
- ULSI DX/DLC 1.2093 0.7935 0.3880 0.3000 2586 61287
- C&T 38700 1.0722 0.7908 0.4007 0.3222 2837 74906
- Cyrix 387+ 1.1305 0.8162 0.3945 0.3208 2946 80322
- Cyrix EMC87 1.2381 0.7963 0.4025 0.3324 3061 86083 //
-
- Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
- Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
-
-
-
- Benchmark results using the Cyrix 486DLC CPU and various coprocessors
- ---------------------------------------------------------------------
-
- The Cyrix 486DLC is the latest entry into the market of 386DX replacement
- processors. It features an Intel 486SX-compatible instruction set, a 1 KB on-
- chip cache, and a 16x16 bit hardware multiplier. The RISC-like execution unit
- of the 486DLC executes many instructions in a single clock cycle. The
- hardware multiplier multiplies 16-bit quantities in 3 clock cycles, as
- compared to 12-25 cycles on a standard Intel 386DX. This is especially useful
- in address calculations (code from non-optimizing compilers may contain many
- MUL instructions for array accesses) and for software floating-point
- arithmetic. The 1 KB cache helps the 486DLC to overcome some of the
- limitations of the 386 bus interface, and although its hit rate averages only
- about 65% under normal program conditions, a 5-15% overall performance
- increase can usually be seen for both integer and floating-point-intensive
- applications when it is enabled.
-
- The 486DLC's internal cache is a unified data/instruction write-through type,
- and can be configured as either a direct mapped or a 2-way set associative
- cache. For compatibility reasons, the cache is disabled after a processor
- reset and must be enabled with the help of a small routine provided by
- Cyrix. Cyrix has also defined some additional cache control signals for some
- of the 486DLC pins, intended to improve communication between the on-chip
- cache and an external cache. Current 386 systems ignore these signals, since
- they are not defined for the standard Intel 386DX. However, future systems
- designed with the 486DLC in mind may take advantage of them for increased
- performance.
-
- In existing 386 systems, DMA transfers (e.g., by a SCSI controller or a
- soundcard) may cause the 486DLC's entire on-chip cache to be flushed, since
- no other means exist to enforce consistency between the cache contents and
- main memory. This reduces the performance of the 486DLC in these cases. The
- 486DLC on-chip cache does, however, allow specification of up to four non-
- cacheable regions, which is particularly useful if your system has memory
- mapped peripherals (e.g., a Weitek coprocessor).
-
- Although I successfully ran my test programs on the Cyrix chip with all
- coprocessors, not all of them worked well with my 486DLC in all circumstances.
- The IIT 3C87, the Cyrix 83D87 (chips manufactured prior to November 1991),
- and the Cyrix EMC87 should not be used with the 486DLC, since they may cause
- the computer to lock up if the FSAVE and FRSTOR instructions are used. (These
- instructions are typically used in protected mode multiple task environments
- to save and restore the coprocessor state for each task. Note that Microsoft
- Windows also fits this description.) According to Cyrix, this problem occurs
- only with first revision 486DLCs (sample chips such as mine) and has been
- fixed since about mid 1993. To be on the safe side, I recommend using the
- Cyrix 387+ with the 486DLC, both for assured compatibility and for best
- performance. Note that 387+ is a 'Europe only' name and that this chip is
- called 83D87 elsewhere, just like the old version. You need to get a 83D87
- produced after about October 1991 to guarantee that is works correctly with
- any 486DLC; the same caveat applies to the Cyrix 486SLC and the Cyrix 83S87.
- If you already have a Cyrix coprocessor, use my COMPTEST program to find out
- whether you have a 'new' or 'old' coprocessor. COMPTEST is available as
- CTEST260.ZIP via anonymous ftp from garbo.uwasa.fi (in the pc/sysinfo
- directory) and other ftp servers that mirror garbo.
-
- The Cyrix 486DLC is currently the 386 'clone' with the highest integer
- performance. With the internal cache enabled, integer performance of the
- 486DLC can be up to 80% higher than that of an Intel 386DX at the same clock
- frequency, with the average speed gain for most applications being about 35%.
- Floating-point applications are typically accelerated by about 15%-30% when
- using a Cyrix 486DLC (with its cache enabled) instead of the Intel 386DX.
-
-
- 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
- Cyrix 486DLC
- (cache off) WITH:
- EM87 emulator 0.0089 0.0082 0.0062 0.0063 31 472 ##
- Franke387 emu. 0.0402 0.0324 0.0258 0.0240 184 4807 @@
- TP/MS-FORT emu 0.0346 0.0288 0.0206 0.0212 173 4401 %%
- Q387 emulator 0.1053 0.0718 0.0356 0.0370 313 9894 ((
- Intel 387DX 0.8455 0.6552 0.3659 0.3033 2249 48780
- ULSI 83C87 1.1818 0.7543 0.3752 0.3026 2381 53476
- IIT 3C87 0.9541 0.6609 0.3653 0.3036 2476 55814
- IIT 3C87,4X4 0.9541 1.4988 0.3653 0.3036 2476 55814 $$
- ULSI DX/DLC 1.1818 0.7543 0.3752 0.2955 2467 58027
- C&T 38700 1.1183 0.7644 0.3796 0.3087 2703 73350
- Cyrix 387+ 1.1305 0.7445 0.3727 0.3060 2731 81967
- Cyrix EMC87 1.2236 0.7593 0.3823 0.3144 2908 88889 //
-
- Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
- Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
-
-
-
- 40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
- Cyrix 486DLC
- (cache off) WITH:
- EM87 emulator 0.0107 0.0098 0.0075 0.0075 37 567 ##
- Franke387 emu. 0.0488 0.0392 0.0311 0.0288 223 5808 @@
- TP/MS-FORT emu 0.0416 0.0345 0.0246 0.0253 208 5284 %%
- Q387 emulator 0.1265 0.0862 0.0429 0.0444 375 11886 ((
- Intel 387DX 1.0196 0.7880 0.4375 0.3644 2712 58479
- ULSI 83C87 1.4247 0.9064 0.4506 0.3630 2868 64171
- IIT 3C87 1.1556 0.7963 0.4399 0.3611 2988 66964
- IIT 3C87,4X4 1.1556 1.7916 0.4399 0.3611 2988 66964 $$
- ULSI DX/DLC 1.4243 0.9064 0.4510 0.3544 2976 69606
- C&T 38700 1.3333 0.9210 0.4548 0.3708 3254 88106
- Cyrix 387+ 1.3507 0.8958 0.4477 0.3754 3297 98361
- Cyrix EMC87 1.4648 0.9136 0.4548 0.3773 3505 106572 //
-
- Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
- Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
-
-
-
- 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
- Cyrix 486DLC
- (cache on) WITH:
- EM87 emulator 0.0099 0.0089 0.0068 0.0069 35 550 ##
- Franke387 emu. 0.0462 0.0362 0.0288 0.0265 205 5445 @@
- TP/MS-FORT emu 0.0410 0.0330 0.0234 0.0241 198 5339 %%
- Q387 emulator 0.1137 0.0784 0.0385 0.0398 336 10455 ((
- Intel 387DX 0.8525 0.6552 0.3941 0.3279 2332 49834
- ULSI 83C87 1.2093 0.7543 0.4068 0.3270 2478 57197
- IIT 3C87 0.9720 0.6609 0.3959 0.3295 2579 57252
- IIT 3C87,4X4 0.9720 1.5087 0.3959 0.3295 2579 57252 $$
- ULSI DX/DLC 1.1954 0.7543 0.4073 0.3214 2564 59583
- C&T 38700 1.1305 0.7644 0.4126 0.3343 2839 75949
- Cyrix 387+ 1.1429 0.7445 0.4023 0.3310 2866 85349
- Cyrix EMC87 1.2381 0.7593 0.4150 0.3412 3051 93897 //
-
- Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
- Intel 486DX 2.0800 1.7779 0.9387 0.6682 5143 82192
-
-
-
- 40.0 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
- Cyrix 486DLC
- (cache on) WITH:
- EM87 emulator 0.0118 0.0107 0.0082 0.0082 42 659 ##
- Franke387 emu. 0.0565 0.0438 0.0350 0.0313 248 6585 @@
- TP/MS-FORT emu 0.0491 0.0395 0.0279 0.0296 238 6408 %%
- Q387 emulator 0.1365 0.0942 0.0463 0.0477 403 12555 ((
- Intel 387DX 1.0297 0.7880 0.4748 0.3937 2801 59821
- ULSI 83C87 1.4445 0.9028 0.4891 0.3926 2976 65789
- IIT 3C87 1.1686 0.7963 0.4734 0.3916 3096 68729
- IIT 3C87,4X4 1.1686 1.8057 0.4734 0.3916 3096 68729 $$
- ULSI DX/DLC 1.4445 0.9064 0.4893 0.3864 3069 71514
- C&T 38700 1.3685 0.9173 0.4958 0.4012 3401 91185
- Cyrix 387+ 1.3867 0.8958 0.4887 0.3962 3448 102564
- Cyrix EMC87 1.4857 0.9100 0.4959 0.4091 3676 112360 //
-
- Intel RapidCAD 2.2128 1.8931 0.7377 0.5432 4810 86957
- Intel 486DX 2.4762 2.1335 1.1110 0.8204 6195 98522
-
-
-
-
- Benchmark results using the C&T 38600DX CPU and various coprocessors
- --------------------------------------------------------------------
-
- The Chips&Technologies 38600DX CPU is marketed as a 100% compatible
- replacement for the Intel 386DX CPU. Unlike AMD's Am386, which uses microcode
- that is identical to the Intel 386DX's, the C&T 38600DX uses microcode
- developed independently by C&T using "clean-room" techniques. C&T even
- included the 386DX's "undocumented" LOADALL386 instruction into the
- instruction set to provide full compatibility with the 386DX. In my tests,
- however, I observed that the 38600DX has severe problems with the CPU-
- coprocessor communication, which causes the floating-point performance to
- drop below that of the Intel 386DX/Intel 387DX for most programs. This
- problem exists with all available 387-compatible coprocessors (ULSI 83C87,
- IIT 3C87, Cyrix EMC87, Cyrix 83D87, Cyrix 387+, C&T 38700, Intel 387DX). A
- net.aquaintance also did tests with the 38600DX and arrived at similar
- results. He contacted C&T and they said that they were aware of the problem.
-
- Some instructions execute faster on the C&T 38600DX than on the 386DX, giving
- an average speedup of 5-10% for integer applications. C&T also produces a
- 38605DX CPU that includes a 512 byte instruction cache and provides a further
- performance increase. However, the 38605DX needs a bigger socket (144-pin
- PGA) and is therefore *not* pin-compatible with the 386DX. Tests using the
- 38600DX were run at 33.3 MHz, as a 40 MHz version was not available as of 09-
- 17-92 and running the 33 MHz chip version at 40 MHz locked up the machine
- frequently. Unfortunately, tests using the Intel 387DX consistently locked up
- in the TRNSFORM benchmark when run at 33.3 MHz. It ran fine at 20 MHz, and
- the results were scaled to show expected performance at 33.3 MHz.
-
-
- 33.3 MHz PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- C&T 38600DX WITH:
- Intel 387DX 0.7376 0.5620 0.3337 0.2636 2066 45489
- ULSI 83C87 0.5226 0.4690 0.3236 0.2654 2087 43228
- IIT 3C87 0.7879 0.5762 0.3397 0.2674 2263 51195
- IIT 3C87,4X4 0.7879 0.6181 0.3397 0.2674 2263 51195 $$
- C&T 38700 0.5977 0.5572 0.3463 0.2681 2338 63966
- Cyrix 387+ 0.5896 0.5508 0.3438 0.2673 2375 66741
-
- Intel RapidCAD 1.8572 1.5798 0.6072 0.4533 3953 72464
- Intel 486 2.0800 1.7779 0.9387 0.6682 5143 82192
-
-
- For comparison:
-
- PEAKFLOP TRNSFORM LLL Linpack Whetstone Savage
- MFLOPS MFLOPS MFLOPS MFLOPS kWhet/sec Func/sec
-
- Pentium-66 11.5557 8.1900 2.5971 2.2150 44000 337079 ))
- i486DX2-66 4.1601 3.4227 1.6531 1.3010 10655 163934
- i486DX2-50 3.0589 2.6665 1.2537 0.9744 7962 123203
- Cyrix EMC87-25 0.7800 0.4971 0.2514 0.2106 1905 53763 [[
- IIT 3C87-25 0.6341 1.0751 0.2399 0.1985 1652 36765 $$,[[
- IIT 3C87-25 0.6341 0.4467 0.2399 0.1985 1652 36765 [[
- i387, 20 MHz 0.2253 0.3271 0.1434 0.1171 952 21739 ++
- i387DX, 20 MHz 0.3567 0.4444 0.1484 0.1161 1034 24155 &&
- i80287, 5 MHz 0.0281 0.0310 0.0242 0.0222 150 3261 !!
- i8087,9.54 MHz 0.0636 0.0705 0.0321 0.0219 234 5782 **
-
-
-
- Benchmark notes and footnotes
- -----------------------------
-
- Hardware configuration for test of 387 coprocessors with C&T 38600DX, Intel
- 386DX, Cyrix 486DLC, and Intel RapidCAD CPUs:
-
- System A: Motherboard with Forex chip set, 128 KB CPU Cache, 8 MB RAM
-
-
- Hardware configuration for test of 486DX FPU (extra fan for 40 MHz operation):
-
- System B: Motherboard with SIS chip set, 256 KB CPU Cache, 8 MB RAM
-
-
- ## EM87 V1.2 by Ron Kimball is a public domain coprocessor emulator that
- loads as a TSR. It uses INT 7 traps emitted by 80286, 80386, or 486SX
- systems with no coprocessor upon encountering coprocessor instructions
- to catch coprocessor instructions and emulate them. Whetstone and Savage
- benchmarks for this test were compiled with the original TP 6.0 library,
- as EM87 chokes on the 387 specific FSIN and FCOS instructions used in my
- own library if a 387 is detected. Obviously EM87 identifies itself as a
- 387, but it has no support for 387-specific instructions.
-
- @@ Franke387 is a commercial 387 emulator that is also available in a
- shareware version. For this test, shareware version V2.4 was used.
- Franke387 unlike many other emulators supports all 387 instructions.
- It is loaded as a device driver and uses INT 7 to trap coprocessor
- instructions.
-
- (( Q387 is an emulator that is distributed as a shareware program by
- Quickware of Austin, Texas. As the name implies, this emulator uses
- 386 specific code and supports the full 387 instruction set. The
- program is about 360 kByte in size and loads completely into extended
- memory, using absolutely no DOS memory. It is loaded as a TSR and
- requires an EMM (expanded memory manager) to be present. For the
- tests done for this version of this article, QEMM 7.04 was used. The
- emulation uses the INT 7 mechanism. The version of Q387 used for this
- version of this report was 3.63. Q387 seems to be the only coprocessor
- emulator that is still continously being updated.
-
- %% These benchmarks were run using the built-in coprocessor emulators of
- the TP 6.0 (for Savage, LLL, Whetstone, TRNSFORM, PEAKFLOP) and the MS
- FORTRAN 5.0 (for Linpack) run-time libraries by forcing the libraries
- into not using a coprocessor by using the environment settings NO87=NC
- and 87=N.
-
- $$ The 3C87 specific F4X4 instruction was used in the vector transformation
- benchmark.
-
- // The EMC87 was used in the 387-compatible mode only. The faster memory-
- mapped mode was *not* used. Times should therefore be identical to the
- Cyrix 83D87.
-
- ++ Older motherboard with no chip set (discrete logic), no CPU cache, 16 MB
- RAM
-
- && System A, CPU cache disabled via extended set-up, turbo-switch set to
- half speed (that is, 20 MHz)
-
- [[ System A, CPU Intel 386DX, oscillator changed to run the system at 25 MHz.
-
- )) System based on an Intel motherboard with 256 KByte of CPU cache and 16
- MB of RAM with an AMI BIOS.
-
- !! 80386 @ 20 MHz / Intel 80287 @ 5 MHz, no CPU cache, 4 MB RAM due to the
- fast CPU used here, performance figures are somewhat higher than can be
- expected for a 80286/287 combination, except for the PEAKFLOP benchmark,
- which is basically coprocessor limited.
-
- ** 8086/8087 system with 640 KB RAM
-
-
- Benchmark results for Weitek coprocessors
- ------------------------------------------
- Since neither a Weitek coprocessor nor a compiler that generates code for the
- Weitek chips were available to me, performance data for the Weitek Abacus is
- given here according to [31,32] and scaled to show performance of a 33 MHz
- system. The benchmarks were compiled using highly-optimizing 32-bit
- compilers.
-
- Single Prec. Double Prec. Double Prec.
-
- 3167 4167 3167 4167 387 486
-
- Linpack MFLOPS 1.8 5.0 0.8 3.2 0.4 1.6
- Whetstone kWhet/sec 7470 22700 4900 14000 3290 12300
-
- Note that for the Intel coprocessors, running programs in single vs. double-
- precision doesn't provide much of an performance advantage since all internal
- calculations are always done in extended precision. Using Weitek
- coprocessors, however, performance nearly doubles in single-precision mode.
- For double-precision calculations using only basic arithmetic, the Weitek
- Abacus can at most provide performance at twice the level of the respective
- Intel coprocessor (387/486) at the same clock speed.
-
-
- Comparison of floating-point performance [30,32]
-
- single-precision
-
- Weitek 4167-33 Intel 486-33 Intel 486DX2-66
-
- Linpack MFLOPS 5.0 1.8 3.5
- Whetstones kWhet/sec 22700 12700 25500
-
-
- double-precision
-
- Weitek 4167-33 Intel 486-33 Intel 486DX2-66
-
- LINPACK MFLOPS 3.5 1.6 3.1
- kWhetstones/sec 14000 12300 24700
-
-
-
- =============================================================================
- Clock-cycle timings for coprocessor instructions on various coprocessor chips
- =============================================================================
-
- Speed of various coprocessor instructions, measured in clock cycles, as
- captured by my program 387TIMES. Error is +/- one clock cycle, except for the
- Intel 80287. Times for the 80287 were determined on a system with a 20 MHz
- 80386 and a 5 MHz Intel 80287. Therefore, times may differ from a genuine
- 80286/287 system, especially for those instructions that access an operand in
- memory. Since the times are stated as the number of coprocessor clock cycles
- used, the faster 386 which can execute four clock cycles where the 80287
- executes one clock cycle may decrease memory access times as seen by the
- coprocessor.
-
- Due to the limited accuracy of the timer used to measure the speed of FPU
- instructions combined with the high clock frequency and the high execution
- speed for simple FPU instructions on the Pentium, the times for fast
- instructions could not be measured reliably. The data given represents the
- average from several runs of 387TIMES. The times for slower instructions
- are accurate to about +/- one clock cycle as stated above.
-
- The CPU used in testing the 387 coprocessors was an Intel 386DX. Note that
- due to the improved coprocessor interface of the Cyrix 486DLC the execution
- time of most coprocessor instructions drops by 2-3 clock cycles when used
- with this CPU.
-
-
- Intel Intel Intel Cyrix Cyrix C&T ULSI ULSI IIT Intel Intel
- Pentium i486 RapidCAD 83D87 387+ 38700 DX/DLC 83C87 3C87 387DX 80387
-
- FLD1 1 4 3 14 14 14 19 18 24 23 26
- FLDZ 1 4 3 14 14 14 19 18 24 23 31
- FLDPI 3 7 8 14 15 14 19 18 24 38 45
- FLDLG2 3 7 8 14 14 14 19 18 24 33 45
- FLDL2T 3 7 8 14 14 14 19 19 24 38 45
- FLDL2E 3 7 8 14 14 14 19 19 24 38 45
- FLDLN2 3 7 8 14 14 14 19 19 24 38 45
- FLD ST(0) 1 4 4 14 14 14 14 14 24 20 21
- FST ST(1) 2 3 4 14 14 14 14 14 19 18 22
- FSTP ST(0) 2 4 4 14 14 14 15 15 19 19 22
- FSTP ST(1) 3 4 4 15 15 14 15 15 19 20 22
- FLD ST(1) 1 4 4 14 14 14 14 14 24 18 21
- FXCH ST(1) 1 4 4 14 20 14 19 19 24 24 27
- FILD [Word] 1 12 16 33 37 32 42 42 38 47 62
- FILD [DWord] 1 8 11 26 26 21 33 32 28 35 45
- FILD [QWord] 1 9 15 30 30 25 37 36 32 34 54
- FLD [DWord] 1 3 5 26 26 21 23 23 28 20 25
- FLD [QWord] 1 3 7 30 30 25 27 27 32 24 35
- FLD [TByte] 3 5 11 46 46 46 46 46 47 46 57
- FBLD [TByte] 46 83 90 66 86 106 146 146 197 71 278
- FIST [Word] 5 31 31 37 40 37 42 42 51 69 90
- FIST [DWord] 5 29 30 35 40 35 40 40 49 66 84
- FST [DWord] 1 7 7 35 37 32 40 40 33 37 40
- FST [QWord] 1 8 9 43 43 39 47 47 40 45 51
- FISTP [Word] 8 32 32 42 40 37 43 43 46 70 90
- FISTP [DWord] 9 31 31 40 40 35 41 41 50 67 87
- FISTP [QWord] 8 29 29 44 44 42 48 48 56 73 92
- FSTP [DWord] 5 8 8 38 36 32 41 41 35 38 43
- FSTP [QWord] 5 9 9 46 43 39 48 48 42 46 49
- FSTP [TByte] 6 8 8 50 45 49 50 50 48 53 58
- FBSTP [TByte] 155 170 172 98 98 114 132 129 218 144 533
- FINIT 21 17 31 15 16 15 15 15 16 16 25
- FCLEX 8 7 20 15 16 16 15 16 16 16 25
- FCHS 1 7 8 14 15 14 14 14 19 30 33
- FABS 1 5 5 14 15 14 14 14 19 30 33
- FXAM 16 12 13 14 15 14 14 14 19 39 43
- FTST 1 5 5 19 25 14 24 24 24 34 38
- FSTENV 60 67 82 125 125 124 132 132 124 159 165
- FLDENV 37 44 59 106 106 112 117 120 106 119 129
- FSAVE 154 181 169 355 355 374 362 361 376 469 511
- FRSTOR 74 130 203 358 358 385 369 372 371 420 456
- FSTSW [mem] 4 4 5 14 14 14 14 14 14 14 17
- FSTSW AX 4 3 4 12 12 11 11 11 11 11 14
- FSTCW [mem] 2 4 5 14 14 13 13 13 13 14 18
- FLDCW [mem] 7 4 11 26 26 31 25 32 27 32 36
- FADD ST,ST(0) 2 8 9 19 20 19 19 19 24 24 32
- FADD ST,ST(1) 3 9 9 19 20 19 18 18 24 20 32
- FADD ST(1),ST 3 10 10 19 20 19 18 18 24 24 37
- FADDP ST(1),ST 3 11 11 19 19 19 15 16 24 25 37
- FADD [DWord] 1 9 10 25 28 22 23 23 23 21 34
- FADD [QWord] 1 9 10 32 32 26 27 27 27 25 38
- FIADD [Word] 3 20 21 34 34 33 39 40 40 52 80
- FIADD [DWord] 3 20 21 27 28 27 29 30 30 37 61
- FSUB ST(1),ST 3 10 10 19 20 19 19 19 24 24 38
- FSUBR ST(1),ST 2 9 10 19 22 19 19 19 24 27 38
- FSUBRP ST(1),ST 2 10 10 19 19 22 19 20 24 25 38
- FSUB [DWord] 2 11 12 27 28 27 23 23 29 27 32
- FSUB [QWord] 2 11 12 32 32 31 27 27 33 26 44
- FISUB [Word] 3 21 21 34 34 34 39 40 40 52 80
- FISUB [DWord] 3 21 22 27 28 27 29 29 30 40 60
- FMUL ST,ST(1) 2 16 17 19 25 24 24 24 29 38 57
- FMUL ST(1),ST 3 16 17 19 24 24 24 24 29 40 62
- FMULP ST(1),ST 2 17 17 19 24 24 25 25 29 40 58
- FIMUL [Word] 3 22 23 40 40 37 45 46 46 52 80
- FIMUL [DWord] 3 22 23 27 28 27 35 36 35 45 68
- FMUL [DWord] 2 11 12 27 28 27 28 28 29 25 45
- FMUL [QWord] 2 14 15 32 32 31 32 32 33 37 61
- FDIV ST,ST(0) 38 73 74 26 40 59 54 54 54 89 100
- FDIV ST,ST(1) 38 73 74 36 45 59 54 54 54 77 100
- FDIV ST(1),ST 38 73 74 36 45 59 54 55 54 78 102
- FDIVR ST(1),ST 38 73 74 36 45 59 54 54 54 77 102
- FDIVRP ST(1),ST 38 73 74 36 44 59 55 55 54 76 106
- FIDIV [Word] 39 84 85 52 58 75 75 76 76 105 141
- FIDIV [DWord] 39 84 85 45 46 65 65 65 65 101 123
- FDIV [DWord] 38 73 74 45 46 63 56 56 59 77 101
- FDIV [QWord] 38 73 74 50 50 67 60 60 63 78 103
- FSQRT (0.0) 4 25 25 19 19 14 19 19 24 29 37
- FSQRT (1.0) 70 83 84 36 74 54 89 89 59 109 132
- FSQRT (L2T) 70 86 87 36 74 54 89 89 59 104 137
- FXTRACT (L2T) 12 17 17 19 19 19 28 28 79 53 72
- FSCALE (PI,5) 31 30 30 36 24 24 49 49 79 59 82
- FRNDINT (PI) 19 31 31 19 29 24 34 34 29 49 82
- FPREM (99,PI) 27 58 59 54 99 44 54 54 49 79 96
- FPREM1(99,PI) 41 90 91 54 99 44 59 59 54 104 121
- FCOM 1 5 6 15 20 19 24 25 19 29 32
- FCOMP 2 6 6 15 19 19 25 25 19 30 33
- FCOMPP 2 7 7 15 19 19 25 25 19 31 40
- FICOM [Word] 3 16 17 34 34 33 45 46 34 58 76
- FICOM [DWord] 3 16 16 21 28 21 35 35 23 45 57
- FCOM [DWord] 1 5 6 21 28 22 23 23 23 27 34
- FCOM [QWord] 1 5 8 27 32 25 27 27 27 31 39
- FSIN (0.0) 17 24 24 14 99 14 19 19 24 39 43
- FSIN (1.0) 98 310 313 114 164 144 319 494 219 509 596
- FSIN (PI) 73 88 89 118 189 64 124 64 214 134 152
- FSIN (LG2) 82 292 295 72 89 139 284 454 184 449 531
- FSIN (L2T) 72 299 302 123 179 164 304 469 214 454 536
- FCOS (0.0) 18 24 24 19 159 14 19 19 24 34 42
- FCOS (1.0) 96 302 305 84 104 139 319 489 214 459 547
- FCOS (PI) 72 88 89 154 254 64 119 64 224 199 232
- FCOS (LG2) 83 300 303 108 149 139 279 454 194 504 583
- FCOS (L2T) 72 307 310 159 239 164 299 469 224 509 601
- FSINCOS (0.0) 15 25 25 14 19 19 18 18 34 38 55
- FSINCOS (1.0) 107 353 356 124 174 254 324 493 419 538 636
- FSINCOS (PI) 93 105 106 162 263 79 124 68 424 228 277
- FSINCOS (LG2) 91 340 343 119 159 249 283 458 359 533 627
- FSINCOS (L2T) 93 347 350 168 248 274 303 473 424 538 646
- FPTAN (0.0) 15 25 25 14 19 19 18 18 29 38 46
- FPTAN (1.0) 142 266 269 119 149 184 363 538 309 323 396
- FPTAN (PI) 125 145 146 134 228 104 169 108 304 168 211
- FPTAN (LG2) 126 244 246 94 129 179 328 498 274 298 363
- FPTAN (L2T) 125 247 249 139 219 204 348 513 304 298 365
- FPATAN (0.0) 25 38 39 19 24 19 19 20 29 95 93
- FPATAN (1.0) 93 294 298 124 159 29 374 375 604 360 433
- FPATAN (PI) 132 304 308 139 188 279 359 360 424 375 472
- FPATAN (LG2) 129 290 293 128 154 269 364 365 379 375 448
- FPATAN (L2T) 135 304 308 144 189 274 359 359 424 375 468
- F2XM1 (0.0) 14 25 25 14 14 14 19 19 24 34 37
- F2XM1 (LN2) 52 209 211 89 119 169 389 394 284 299 348
- F2XM1 (LG2) 52 204 206 78 104 159 374 379 284 294 337
- FYL2X (1.0) 39 60 61 36 39 24 74 75 94 115 127
- FYL2X (PI) 104 294 297 108 163 249 449 450 359 395 504
- FYL2X (LG2) 104 311 314 108 159 249 459 460 339 410 518
- FYL2X (L2T) 104 293 296 108 164 249 434 439 359 390 501
- FYL2XP1 (LG2) 103 334 337 99 169 234 459 460 284 435 538
-
-
-
- 486DLC + 386DX + 386DX + 386DX + 386DX +
- Intel Intel Q387 Q387 Franke387 TP 6.0 EM87
- 8087 80287 Emulator Emulator Emulator Emulator Emulator
-
- FLD1 26 55 42 75 481 422 1626
- FLDZ 21 53 36 70 480 416 1646
- FLDPI 26 55 42 76 486 443 1626
- FLDLG2 26 56 54 75 486 423 1626
- FLDL2T 26 55 43 75 486 440 1626
- FLDL2E 26 53 41 75 486 423 1626
- FLDLN2 26 55 41 75 486 441 1626
- FLD ST(0) 31 55 51 87 493 362 1851
- FST ST(1) 26 54 35 65 489 355 1931
- FSTP ST(0) 26 54 48 79 507 358 2115
- FSTP ST(1) 21 55 53 93 507 356 2116
- FLD ST(1) 26 55 54 97 493 362 1852
- FXCH ST(1) 21 57 60 102 497 486 2187
- FILD [Word] 58 90 87 139 667 712 2259
- FILD [DWord] 64 74 88 141 608 812 2164
- FILD [QWord] 74 93 117 194 652 707 2971
- FLD [DWord] 49 44 78 135 633 473 2077
- FLD [QWord] 54 57 84 137 641 524 2336
- FLD [TByte] 59 45 74 129 607 492 2063
- FBLD [TByte] 309 310 465 775 2019 1512 17827
- FIST [Word] 79 72 88 144 854 766 2418
- FIST [DWord] 84 80 87 142 865 518 2325
- FST [DWord] 89 85 89 148 686 441 2200
- FST [QWord] 99 92 91 145 703 516 2481
- FISTP [Word] 79 80 96 164 864 794 2620
- FISTP [DWord] 79 81 94 158 879 541 2523
- FISTP [QWord] 88 75 127 207 904 916 3226
- FSTP [DWord] 89 75 95 166 713 467 2400
- FSTP [QWord] 93 72 96 161 732 538 2678
- FSTP [TByte] 49 21 83 137 685 467 2124
- FBSTP [TByte] 528 472 696 1140 3305 1555 27013
- FINIT 11 10 116 200 742 641 1369
- FCLEX 11 10 29 48 440 323 912
- FCHS 21 54 34 53 460 354 1744
- FABS 21 54 31 47 456 349 1738
- FXAM 21 54 43 73 481 380 1551
- FTST 51 75 41 70 585 386 2721
- FSTENV 54 57 449 712 928 519 2104
- FLDENV 48 50 443 686 1125 450 1631
- FSAVE 214 244 2088 2932 1949 976 2749
- FRSTOR 209 227 1828 2795 2182 657 2225
- FSTSW [mem] 28 10 52 87 516 401 1189
- FSTSW AX N/A 55 317 423 451 N/A N/A
- FSTCW [mem] 28 10 50 74 506 359 1167
- FLDCW [mem] 19 47 54 90 524 437 1584
- FADD ST,ST(0) 86 128 107 170 643 706 2805
- FADD ST,ST(1) 85 116 130 211 707 808 3093
- FADD ST(1),ST 92 131 136 227 664 812 3146
- FADDP ST(1),ST 92 129 137 223 704 799 3143
- FADD [DWord] 105 122 164 266 874 969 3139
- FADD [QWord] 115 122 164 277 888 1021 3396
- FIADD [Word] 115 122 178 286 940 1211 3330
- FIADD [DWord] 125 122 200 285 882 1297 3215
- FSUB ST(1),ST 88 130 147 225 738 817 3156
- FSUBR ST(1),ST 96 132 135 219 740 868 3004
- FSUBRP ST(1),ST 99 132 146 242 733 805 3301
- FSUB [DWord] 119 122 173 268 918 1018 3127
- FSUB [QWord] 129 123 171 295 932 1070 3632
- FISUB [Word] 115 123 189 315 977 1081 3802
- FISUB [DWord] 125 125 198 335 940 980 4161
- FMUL ST,ST(1) 145 151 162 395 810 1368 3924
- FMUL ST(1),ST 145 151 162 392 817 1377 3962
- FMULP ST(1),ST 148 168 162 414 840 1365 4164
- FIMUL [Word] 132 151 227 480 1039 1517 4039
- FIMUL [DWord] 141 151 249 479 980 1643 3976
- FMUL [DWord] 125 123 204 439 948 1480 3445
- FMUL [QWord] 175 192 207 478 991 1602 4416
- FDIV ST,ST(0) 201 207 253 369 726 1536 9789
- FDIV ST,ST(1) 203 218 257 435 808 1658 10332
- FDIV ST(1),ST 207 214 251 440 825 1655 10342
- FDIVR ST(1),ST 201 206 260 448 819 1806 10213
- FDIVRP ST(1),ST 201 205 281 467 845 1803 10409
- FIDIV [Word] 237 227 315 487 980 1779 11225
- FIDIV [DWord] 246 227 326 510 944 1680 11572
- FDIV [DWord] 229 226 314 447 893 1722 10577
- FDIV [QWord] 236 227 320 522 993 1777 10829
- FSQRT (0.0) 21 57 55 95 512 382 1755
- FSQRT (1.0) 186 206 221 305 1106 2504 37836
- FSQRT (L2T) 186 207 218 312 1398 2467 37925
- FXTRACT (L2T) 51 56 89 182 726 571 3326
- FSCALE (PI,5) 41 56 66 121 817 443 3194
- FRNDINT (PI) 51 58 108 165 808 800 7092
- FPREM (99,PI) 81 131 231 356 1696 941 4098
- FPREM1(99,PI) N/A N/A 261 412 1625 N/A N/A
- FCOM 56 75 87 160 582 483 2799
- FCOMP 61 92 92 172 616 485 2983
- FCOMPP 61 90 115 184 661 476 3198
- FICOM [Word] 79 77 140 236 808 861 3654
- FICOM [DWord] 89 77 144 235 750 964 3684
- FCOM [DWord] 74 75 126 215 741 625 3643
- FCOM [QWord] 74 76 121 206 754 667 3771
- FSIN (0.0) N/A N/A 81 137 639 N/A N/A
- FSIN (1.0) N/A N/A 497 1004 4640 N/A N/A
- FSIN (PI) N/A N/A 245 375 2488 N/A N/A
- FSIN (LG2) N/A N/A 482 988 3911 N/A N/A
- FSIN (L2T) N/A N/A 525 1021 3767 N/A N/A
- FCOS (0.0) N/A N/A 107 178 740 N/A N/A
- FCOS (1.0) N/A N/A 523 1005 4777 N/A N/A
- FCOS (PI) N/A N/A 250 351 2557 N/A N/A
- FCOS (LG2) N/A N/A 452 980 4176 N/A N/A
- FCOS (L2T) N/A N/A 528 1012 3905 N/A N/A
- FSINCOS (0.0) N/A N/A 169 239 714 N/A N/A
- FSINCOS (1.0) N/A N/A 961 1850 6049 N/A N/A
- FSINCOS (PI) N/A N/A 327 458 4091 N/A N/A
- FSINCOS (LG2) N/A N/A 857 1535 5640 N/A N/A
- FSINCOS (L2T) N/A N/A 906 1578 5405 N/A N/A
- FPTAN (0.0) 41 58 58 103 752 8381 2324
- FPTAN (1.0) 581 582 655 1211 6366 10817 29824
- FPTAN (PI) 606 587 267 332 4388 12410 2300
- FPTAN (LG2) 516 513 411 903 5939 12502 26770
- FPTAN (L2T) 576 586 455 975 5723 12483 2301
- FPATAN (0.0) 41 55 138 223 616 1208 10578
- FPATAN (1.0) 736 736 121 200 1426 13446 34208
- FPATAN (PI) 206 207 576 1128 2835 13305 46903
- FPATAN (LG2) 756 736 556 1087 2490 13319 41312
- FPATAN (L2T) 206 204 559 1130 2922 13364 50149
- F2XM1 (0.0) 16 56 59 99 563 723 1722
- F2XM1 (LN2) 631 624 388 919 4178 11070 33823
- F2XM1 (LG2) 611 585 386 903 4798 11116 32163
- FYL2X (1.0) 56 57 76 143 961 1214 4327
- FYL2X (PI) 946 961 463 1032 8987 12858 40148
- FYL2X (LG2) 1081 1038 471 1060 8933 12748 46821
- FYL2X (L2T) 926 886 508 1114 8982 12712 38986
- FYL2XP1 (LG2) 1026 1037 564 1199 10485 11867 44708
-
-
- Clock-cycle timings for floating-point operations on Weitek coprocessors
- ------------------------------------------------------------------------
-
- The Weitek 3167 and 4167 coprocessors only implement the basic arithmetic
- functions (add, subtract, multiply, divide, square root) in hardware;
- transcendental functions are implemented by means of a software library
- supplied by Weitek which uses the basic hardware instructions to approximate
- the transcendental functions (using polynomial and rational approximations).
- The clock cycle timings for the transcendental functions are average values,
- since execution time can differ with the value of argument. The speed of
- transcendental functions for the 4167 is estimated based on the numbers in
- [31,33], from which this timing information has been extracted.
-
-
- Single-precision Double-precision
-
- 3167 4167 3167 4167
-
- ABS 3 2 3 2
- NEG 6 2 6 2
- ADD 6 2 6 2
- SUB 6 2 6 2
- SUBR 6 2 6 2
- MUL 6 2 10 3
- DIVR 38 17 66 31
- SQRT 60 17 118 31
- SIN 146 ~50 292 ~100
- COS 140 ~50 285 ~100
- TAN 188 ~60 340 ~110
- EXP 179 ~60 401 ~130
- LOG 171 ~60 365 ~120
- F->ASCII 1000 N/A 1700 N/A //
- ASCII->F 1100 N/A 1800 N/A //
-
- // rough average of the timings given for different numeric
- formats by Weitek. Note that these conversions routines
- do much more work than the FBLD and FBSTP instructions
- provided by the 80x87 coprocessors. FBLD and FBSTP are
- useful for conversion routines but quite a bit of additional
- code is need for this purpose.
-
-
-
- =============================================================================
- Accuracy of calculations performed by a coprocessor / The IEEETEST program
- =============================================================================
-
- Among the 80x87 coprocessors, the IEEE-754 Standard for Binary Floating-Point
- Arithmetic [10,11] was first fully implemented by Intel's 387 coprocessor [17].
- Among other things, this means that the add, subtract, multiply, divide,
- remainder, and square root operations always deliver the 'exact' result. By
- 'exact', the standard means that the coprocessor always delivers the machine
- number closest to the real result, which may not always be representable
- exactly in the available numeric format. The 80387 implements the single,
- double, and double extended formats as specified in the IEEE standard, as
- well as all functions required by it [17].
-
- Note that earlier Intel coprocessors (the 8087 and the 80287) comply with a
- draft version of the standard that differs from the final version. These
- chips were developed before IEEE-754 was finally accepted in 1985. As with
- the 80387, the basic arithmetic in the 8087 and the 80287 is 'exact' in the
- sense that the computed result is always the machine number closest to the
- real result. However, there are some differences regarding certain operands
- like infinities, and some operations like the remainder are defined
- differently than in the final version of the standard.
-
- Some new instructions were introduced with the 80387, most notably the FSIN
- and FCOS operations. The argument range for some transcendental function has
- also been extended [17]. Note that the IEEE-754 standard says nothing about
- the quality of the implementation of transcendental functions like sin, cos,
- tan, arctan, log. Intel uses a modified CORDIC [18,19] technique to compute
- the transcendental functions; Intel claims that maximum error in the 8087,
- 80287, and 80387 for all transcendental functions does not exceed two bits in
- the mantissa of the double extended format, which features 64 mantissa bits
- for an overall accuracy of approximately 19 decimal places [22,23]. This
- claim has been independently verified by a competing vendor [13]. This means
- that at least 62 of the 64 mantissa bits returned as a result by one of the
- transcendental function instructions are guaranteed to be correct.
-
- The Weitek Abacus 3167 and 4167 coprocessors are 'mostly compatible' with
- IEEE-754 [31,32,33]. They support the single-precision and double precision
- numeric formats described in the standard, as well as the four rounding modes
- required by it. However, due to Weitek's desire for extremely high-speed
- operation, some of the finer points of IEEE-754 have not been implemented.
- One of the most notable omissions is the missing support for denormal
- numbers; denormals are always flushed to zero on Weitek chips.
-
- The 387 clone manufacturers all claim 100% compatibility with Intel's 80387,
- so one would reasonably expect the same accuracy from their chips as from
- Intel's. For example, on the packaging of the IIT 3C87 it states that "...the
- requirements of ANSI/IEEE standards are fulfilled and exceeded". Cyrix states
- that their 83D87 complies fully with the IEEE-754 standard [12], and in fact
- delivers with their coprocessors diagnostic software that includes the
- program IEEETEST. This program is based on the IEEE test vectors from the PhD
- thesis of Dr. Jerome T. Coonen [9]. A test using the IEEE test vectors has
- also been included into the RUNDIAG program on the Intel RapidCAD diagnostic
- disk. Rather than performing random tests, the test vectors check specific
- cases that may be hard to get right. Each test vector specifies the operation
- to be performed, the operands, precision and rounding mode to be used, and
- the result (including flags set) to be expected according to the IEEE-754
- standard.
-
- I ran IEEETEST on all the available coprocessors/FPUs. The Intel 486, Intel
- RapidCAD, Intel 387, Intel 387DX, Cyrix 83D87, and the Cyrix 387+ passed with
- no errors. The ULSI 83C87 showed some minor flaws in the FCOM, FDIV, FMUL,
- and FSCALE operations, getting flag errors in about 1% of the tested cases,
- but no computational errors. The newer version ULSI DX/DLC had mismatches for
- the FDIV, FMUL, and FSCALE instructions, all of which where flag errors.
- For the IIT 3C87, the IEEETEST program showed flag *and* some computational
- errors (that is, wrong results) for all tested operations except FXTRACT
- and FCHS. The Intel 8087 and 80287 show numerous errors, but this it not
- surprising, since they do not comply with IEEE-754 but with an earlier draft
- of that standard, so they do some things differently than required by the
- final version of the standard. In particular the Intel 8087/80287 do not
- feature the IEEE-754 compliant comparison (FUCOM) and remainder (FPREM1)
- instructions available on the Intel 80387 and newer coprocessors, so IEEETEST
- uses the non-compliant FCOM and FPREM instructions on these processors. Lack
- of an IEEE-754 compliant comparison instruction also causes a good deal of
- the errors in the 'Next After' test. Since IEEETEST is written in Turbo
- Pascal, it was recompiled with the $E+ switch to enable use of the coprocessor
- emulator built into the TP 6.0 library. Using the emulator, IEEETEST aborted
- in the following tests with a division by zero error: 'Comparison', 'Division',
- 'Next After'. These tests were removed from the suite and the remaining
- tests were performed. The public domain emulator EM87 could be tested, but
- hung in the last test which checks the implementation of the remainder
- operation. This problem occurred because EM87 incorrectly identifies itself
- as an 387 type coprocessor when run on an 80386. This causes the 387 specific
- FUCOM instruction to be used in the 'Comparison' and 'Next After' tests and
- the FPREM1 instruction to be used in the 'Remainder' test. Apparently EM87
- is not able to emulate these instructions and therefore crashes upon trying
- to execute them. It is interesting to note how the error profile of EM87
- matches exactly that of the Intel 80287, so it can be assumed that EM87 is
- a very good emulation of the 80287 when run on the 80286. The Franke387 V2.4
- emulator hangs in the following test performed by IEEETEST: 'Division',
- 'Multiplication', 'Scalb', 'Remainder'. The cause for these failures is
- unknown.
-
-
- This explanatory text is printed at the start of the IEEETEST program:
-
- JT Coonen's 1984 UC Berkeley Ph.D. thesis centers around his activities
- as a member of the floating-point working group that defined the IEEE
- 754-1985 Standard for Binary Floating-Point Arithmetic. Appendix C of
- his thesis presents FPTEST, a Pascal program written by J Thomas and JT
- Coonen. IEEETEST is a port of FPTEST and runs on PCs whose math
- coprocessor accepts 80387-compatible floating-point instructions.
-
- IEEETEST reads test vectors from the file TESTVECS and compares the
- answer returned by the math coprocessor with the answer listed in the
- test vector. If these answers differ an 'F' is displayed, otherwise a
- '.'is displayed. Answers can differ due to two types of failures:
- numeric failures or flag failures. Numeric failures occur when the
- computed answer has the wrong value. Flag failures occur when the status
- (invalid operation, divide by zero, underflow, overflow, inexact) is
- incorrectly identified.
-
- TESTVECS is the concatenation of unmodified versions of all the test
- vectors distributed by UC Berkeley. The test data base is copyrighted by
- UC Berkeley (1985) and is being distributed with their permission.
- FPTEST and the test data base can be obtained by asking for 'IEEE-754
- Test Vector' from UC Berkeley, Electrical Engineering and Computer
- Science, Industrial Liaison Program, 479 Corey Hall, Berkeley, CA, 94720
- (415)643-6687.
-
- The initial version of this test data base for the proposed IEEE 754
- binary floating-point standard (draft 8.0) was developed for Zilog, Inc.
- and was donated to the floating-point working group for dissemination.
- Errors in or additions to the distributed data base should be reported
- to the agency of distribution, with copies to Zilog, Inc., 1315 Dell
- Avenue, Campbell, CA, 95008.
-
-
- IEEETEST output for Intel 80387, Intel 387DX (manufactured 91/49), Intel 486,
- C&T 38700 (manufactured 92/19), Cyrix 83D87, Cyrix 387+ (manufactured 92/11),
- Intel RapidCAD (manufactured 92/05), and Intel Pentium:
- ----------------------------------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 3528 0 | 0 0 0 | 0 0 0
- Comparison C | 4320 0 | 0 0 0 | 0 0 0
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4311 0 | 0 0 0 | 0 0 0
- Fraction Part F | 624 0 | 0 0 0 | 0 0 0
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3978 0 | 0 0 0 | 0 0 0
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 2832 0 | 0 0 0 | 0 0 0
- Round to Integer I | 558 0 | 0 0 0 | 0 0 0
- Scalb S | 948 0 | 0 0 0 | 0 0 0
- Square Root V | 744 0 | 0 0 0 | 0 0 0
- Subtraction - | 3528 0 | 0 0 0 | 0 0 0
- Remainder % | 2984 0 | 0 0 0 | 0 0 0
- Totals | 31235 0 |
-
-
- IEEETEST output for ULSI 83C87 (manufactured 91/48):
- ----------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 3528 0 | 0 0 0 | 0 0 0
- Comparison C | 4312 8 | 0 0 0 | 0 0 8
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4250 61 | 0 0 0 | 28 28 5
- Fraction Part F | 624 0 | 0 0 0 | 0 0 0
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3936 42 | 0 0 0 | 19 19 4
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 2828 4 | 0 0 0 | 0 0 4
- Round to Integer I | 558 0 | 0 0 0 | 0 0 0
- Scalb S | 930 18 | 0 0 0 | 6 6 6
- Square Root V | 744 0 | 0 0 0 | 0 0 0
- Subtraction - | 3528 0 | 0 0 0 | 0 0 0
- Remainder % | 2984 0 | 0 0 0 | 0 0 0
- Totals | 31102 133 |
-
-
- IEEETEST output for ULSI 83S87 (manufactured 92/17) (data kindly supplied
- by Bengt Ask, f89ba@efd.lth.se) and for ULSI DX/DLC (manufactured 94/15):
- -------------------------------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 3528 0 | 0 0 0 | 0 0 0
- Comparison C | 4320 0 | 0 0 0 | 0 0 0
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4296 15 | 0 0 0 | 5 5 5
- Fraction Part F | 624 0 | 0 0 0 | 0 0 0
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3966 12 | 0 0 0 | 4 4 4
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 2828 4 | 0 0 0 | 0 0 4
- Round to Integer I | 558 0 | 0 0 0 | 0 0 0
- Scalb S | 930 18 | 0 0 0 | 6 6 6
- Square Root V | 744 0 | 0 0 0 | 0 0 0
- Subtraction - | 3528 0 | 0 0 0 | 0 0 0
- Remainder % | 2984 0 | 0 0 0 | 0 0 0
- Totals | 31102 45 |
-
-
- IEEETEST output for IIT 3C87 (manufactured 92/20):
- --------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 200 16 | 0 0 16 | 0 0 0
- Addition + | 3336 192 | 0 0 128 | 0 0 96
- Comparison C | 4224 96 | 0 0 96 | 0 0 0
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 4159 152 | 0 0 124 | 0 0 116
- Fraction Part F | 600 24 | 0 0 24 | 0 0 24
- Logb L | 960 0 | 0 0 0 | 0 0 0
- Multiplication * | 3702 276 | 0 0 248 | 0 0 100
- Negation - | 200 16 | 0 0 16 | 0 0 0
- Next After N | 2248 584 | 0 0 584 | 0 0 168
- Round to Integer I | 542 16 | 0 0 4 | 0 0 16
- Scalb S | 874 74 | 5 5 44 | 8 8 20
- Square Root V | 688 56 | 0 0 56 | 0 0 56
- Subtraction - | 3336 192 | 0 0 128 | 0 0 96
- Remainder % | 2844 140 | 0 0 140 | 0 0 116
- Totals | 29401 1834 |
-
-
- IEEETEST output for Intel 80287 run with a 80386 CPU and Intel 8087:
- --------------------------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 2886 642 | 16 16 112 | 174 174 174
- Comparison C | 3612 708 | 136 136 136 | 228 228 228
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 3777 534 | 18 18 37 | 169 169 165
- Fraction Part F | 552 72 | 24 24 24 | 24 24 24
- Logb L | 900 60 | 12 12 12 | 20 20 20
- Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 516 2316 | 168 168 332 | 764 764 764
- Round to Integer I | 546 12 | 0 0 0 | 4 4 4
- Scalb S | 663 285 | 45 43 26 | 102 98 46
- Square Root V | 720 24 | 4 4 4 | 8 8 8
- Subtraction - | 2886 642 | 16 16 112 | 174 174 174
- Remainder % | 1490 1494 | 432 432 288 | 342 342 230
- Totals | 23412 7823 |
-
-
- IEEETEST output for EM87 coprocessor emulator run on an Intel 386 CPU:
- ----------------------------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 216 0 | 0 0 0 | 0 0 0
- Addition + | 2886 642 | 16 16 112 | 174 174 174
- Comparison C | 0 4320 | 1324 1324 1324 |1332 1332 1332
- Copy Sign @ | 1488 0 | 0 0 0 | 0 0 0
- Division / | 3777 534 | 18 18 37 | 169 169 165
- Fraction Part F | 552 72 | 24 24 24 | 24 24 24
- Logb L | 900 60 | 12 12 12 | 20 20 20
- Multiplication * | 2944 1034 | 105 105 197 | 303 303 231
- Negation - | 216 0 | 0 0 0 | 0 0 0
- Next After N | 348 2484 | 768 768 768 | 504 504 526
- Round to Integer I | 546 12 | 0 0 0 | 4 4 4
- Scalb S | 663 285 | 45 43 26 | 102 98 46
- Square Root V | 720 24 | 4 4 4 | 8 8 8
- Subtraction - | 2886 642 | 16 16 112 | 174 174 174
- Remainder % | ######## not run since machine hangs #######
-
-
- IEEETEST output for Franke387 2.4 coprocessor emulator run on an Intel 386:
- ---------------------------------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 152 64 | 0 0 8 | 24 24 8
- Addition + | 1587 1941 | 178 178 722 | 508 508 616
- Comparison C | 3696 624 | 208 208 208 | 4 4 108
- Copy Sign @ | 1200 288 | 0 0 0 | 144 144 0
- Division / | ######## not run since machine hangs #######
- Fraction Part F | 624 0 | 0 0 0 | 0 0 0
- Logb L | 908 52 | 0 0 16 | 16 16 4
- Multiplication * | ######## not run since machine hangs #######
- Negation - | 152 64 | 0 0 8 | 24 24 8
- Next After N | 1404 1420 | 404 404 596 | 80 80 172
- Round to Integer I | 514 44 | 4 4 20 | 8 8 16
- Scalb S | ######## not run since machine hangs #######
- Square Root V | 569 175 | 14 31 54 | 28 48 72
- Subtraction - | 1827 1701 | 98 98 642 | 452 452 576
- Remainder % | ######## not run since machine hangs #######
-
-
- IEEETEST output for Q387 3.63 coprocessor emulator run on an Intel 386:
- -----------------------------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 152 64 | 0 0 16 | 24 24 0
- Addition + | 2320 1208 | 171 173 332 | 364 364 196
- Comparison C | 3924 396 | 0 0 96 | 100 100 100
- Copy Sign @ | 1200 288 | 24 24 0 | 144 144 0
- Division / | 3699 612 | 63 104 273 | 125 125 161
- Fraction Part F | 600 24 | 0 0 24 | 0 0 24
- Logb L | 924 36 | 4 4 4 | 8 8 8
- Multiplication * | 2747 1231 | 158 169 376 | 350 376 246
- Negation - | 152 64 | 8 8 16 | 24 24 0
- Next After N | 1220 1612 | 364 364 584 | 108 108 268
- Round to Integer I | 344 214 | 50 50 54 | 48 72 88
- Scalb S | 452 496 | 80 76 97 | 168 160 96
- Square Root V | 199 545 | 120 161 143 | 164 164 164
- Subtraction - | 2320 1208 | 171 173 332 | 364 364 196
- Remainder % | 2028 956 | 276 276 276 | 160 164 136
- Totals | 22281 8954 |
-
-
- IEEETEST output for TP 6.0 coprocessor emulator:
- ------------------------------------------------
-
- IEEE-754 Test Vector Precisions: S=Single D=Double E=Double Extended
- | TESTS | numeric TYPE OF FAILURE flag
- Operation Code | Passed Failed | S D E | S D E
- ----------------------------------------------------------------------
- Absolute Value A | 168 48 | 16 16 16 | 16 8 0
- Addition + | 1877 1651 | 294 290 336 | 496 456 416
- Comparison C | ## not run - program aborts with div-by-0 ##
- Copy Sign @ | 1392 96 | 48 48 0 | 48 0 0
- Division / | ## not run - program aborts with div-by-0 ##
- Fraction Part F | 588 36 | 12 0 24 | 0 0 0
- Logb L | 888 72 | 24 24 24 | 12 12 12
- Multiplication * | 2148 1830 | 332 310 528 | 520 360 352
- Negation - | 160 48 | 16 16 16 | 16 8 0
- Next After N | ## not run - program aborts with div-by-0 ##
- Round to Integer I | 318 240 | 0 0 4 | 80 80 80
- Scalb S | 564 384 | 108 100 76 | 112 88 56
- Square Root V | 180 564 | 143 157 169 | 72 72 128
- Subtraction - | 1877 1651 | 294 290 336 | 496 456 416
- Remainder % | 1072 1912 | 652 672 524 | 336 288 216
-
-
-
-
- Additional accuracy and compatibility tests
- -------------------------------------------
-
- To complement the checks done by IEEETEST, I also wrote the short programs
- DENORMTS, RCTRL, PCTRL in Turbo Pascal 6.0 that test the following
- coprocessor functions:
-
- 1. support for denormals in all precisions (single, double, extended)
- 2. support for the four IEEE rounding modes (up, down, nearest, chop)
- 3. support for precision control
-
- Note that passing all tests is required for IEEE conformance, as well as 100%
- compatibility with Intel's coprocessors. Precision control forces the results
- of the FADD, FSUB, FMUL, FDIV, and FSQRT instruction to be rounded to the
- specified precision (single, double, double extended). This feature is
- provided to obtain compatibility with certain programming languages [17]. By
- specifying lower precision, one effectively nullifies the advantages of
- extended precision intermediate results.
-
- The IEEE-754 standard for floating-point arithmetic demands that processors
- and floating-point packages that can not store the result of operations
- *directly* to single and double precision location must provide precision
- control. The programs that test precision control and rounding control are
- designed to return a different result for each of the modes for the same
- sequence of operation.
-
- The source code of the programs can be found in appendix A. The Intel 8087
- and 80287 were not tested with DENORMTS since Turbo Pascal does not support
- extended precision denormals on 8087/80287 processors, so the denormal test
- fails anyway. (The 8087 and 287 pass the RCTRL and PCTRL tests without error,
- however).
-
-
- Test Results for the Intel 387, Intel 387DX, Intel 486, Intel RapidCAD,
- Cyrix 83D87, Cyrix 387+, C&T 38700, and the EM87 emulator (on an 80386 system):
- -------------------------------------------------------------------------------
-
- Precision Control SINGLE 1.13311278820037842E+0000
- DOUBLE 1.23456789006442125E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427623555772409E+0100
- UP -1.23457760966801097E+0100
- CHOP -1.23397493540770643E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60943116855005E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.75000000000016E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals supported
- EXTENDED denormal prints as: 1.31640625000000E-4934
- Denormal should be printed as 1.3164...E-4934
-
-
- Results for the ULSI 83C87:
- ---------------------------
-
- Precision Control SINGLE 1.23456789012337585E+0000
- DOUBLE 1.23456789012337585E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427623555772409E+0100
- UP -1.23457760966801097E+0100
- CHOP -1.23397493540770643E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60943116855005E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.75000000000016E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals supported
- EXTENDED denormal prints as: 1.31640625000000E-4934
- Denormal should be printed as 1.3164...E-4934
-
-
- Results for the IIT 3C87:
- -------------------------
-
- Precision Control SINGLE 1.13311278820037842E+0000
- DOUBLE 1.23456789006442125E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427623555772409E+0100
- UP -1.23457760966801097E+0100
- CHOP -1.23397493540770643E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60943116855005E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.75000000000016E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals not supported
-
-
- Results for the Turbo Pascal 6.0 coprocessor emulator:
- ------------------------------------------------------
-
- Precision Control SINGLE 1.23456789012351396E+0000
- DOUBLE 1.23456789012351396E+0000
- EXTENDED 1.23456789012351396E+0000
-
- Rounding Control NEAREST -1.23457766383395931E+0100
- DOWN -1.23457766383395931E+0100
- UP -1.23457766383395931E+0100
- CHOP -1.23457766383395931E+0100
-
- Denormal support
-
- SINGLE denormals not supported
- DOUBLE denormals not supported
- EXTENDED denormals not supported
-
-
- Results for the Q387 3.63 coprocessor emulator:
- -----------------------------------------------
-
- Precision Control SINGLE 1.23456789012337585E+0000
- DOUBLE 1.23456789012337585E+0000
- EXTENDED 1.23456789012337585E+0000
-
- Rounding Control NEAREST -1.23427629010100635E+0100
- DOWN -1.23427629010100635E+0100
- UP -1.23427629010100635E+0100
- CHOP -1.23427629010100635E+0100
-
- Denormal support
-
- SINGLE denormals supported
- SINGLE denormal prints as: 4.60929103870362E-0041
- Denormal should be printed as 4.60943...E-0041
-
- DOUBLE denormals supported
- DOUBLE denormal prints as: 8.74999999999966E-0311
- Denormal should be printed as 8.75...E-0311
-
- EXTENDED denormals not supported
-
-
- The test results show that the IIT 3C87 does not conform to the IEEE-754
- floating-point standard in that it does not support denormals in double
- extended precision. The ULSI 83C87 does not conform to that standard in that
- it does not support precision control, but uses double extended precision
- for all operations. The TP 6.0 emulator supports neither precision control,
- rounding control nor support for any denormals, as does the Q387 3.63
- emulator. In addition, the basic arithmetic operations of the TP 6.0 do not
- seem to conform to the IEEE standard as the results of the test programs
- differ from that of any result computed by a coprocessor for any mode. The
- results for the Q387 3.63 emulator in the precision control test are equal
- to those of a math coprocessor in EXTENDED precision mode. The results for
- the rounding control test are equal to those of a math coprocessor in
- "round to nearest" mode. The denormal support test indicates that Q387 has
- support for single and double precision denormals, but not for double
- extended precision denormals. However, the denormal results differ from
- the results of math coprocessors that support denormals. The test results
- of the three programs indicate that Q387 3.63 correctly implements double
- extended precision arithmetic, except for denormals. Q387 has obviously
- been improved over the previously tested version 3.0, in which the results
- from the PCTRL and RCTRL programs would not match that of any coprocessor.
- Also the numbers of failures on the IEEETEST program has dropped significantly
- from version 3.0 (20743 failures) to version 3.63 (8954 failures).
-
-
-
- ================================================
- Accuracy of transcendental function calculations
- ================================================
-
- With regard to the accuracy of transcendental functions, Cyrix claims that
- the relative error of the transcendental functions on its 83D87 coprocessor
- never exceeds 0.5 ULP of the double extended format [13] (ULP = Unit in the
- Last Place, numeric weight of the least significant mantissa bit). This means
- that the maximum relative error is below 2**-64, while Intel's published
- error limit for the 80387 is 2**-62. While Intel uses a modified CORDIC
- algorithm [18,19] to compute the transcendental functions, Cyrix uses
- rational approximations that utilize their chip's very fast array multiplier.
- (For an explanation why this approach is superior to CORDIC with today's
- technology, see [61].) Also, Cyrix uses an internal 75 bit data path for the
- mantissa [15], so intermediate computations in the generation of
- transcendental function values will enjoy some additional accuracy over the
- 64 bits provided by the double extended format. Using 75 mantissa bits also
- provides an advantage over other coprocessors like the Intel 387DX and ULSI
- 83C87 which use only a 68 bit mantissa data path [58,59].
-
- Note that a maximum relative error of 0.5 ULP for the Cyrix coprocessor does
- not mean that it returns the 'exact' result (machine number closest to
- infinitely precise result) all the time. Consider the case where the
- infinitely precise result of a transcendental function falls nearly halfway
- between two machine numbers. A relative error of 0.5 ULP can cause the result
- to be either of the numbers after rounding, depending on the direction of the
- error. But the 83D87 should deliver results that never differ from the
- 'exact' result by more than one ULP. Also note that the claim of relative
- error being below 0.5 ULPs is slightly exaggerated; 0.6 ULPs would be a more
- realistic error limit. Imagine that the infinitely precise result for some
- argument to a transcendental was xxx..xxx1001... (where the xxx...xxx
- represent the first 64 bits of the result), but that the coprocessor computes
- the result as xxx..xxx0111 and then round this down to xxx..xxx0000. Then the
- relative error is (1001b-0b)/1000b = 0.5625 ULPs.
-
- I tested some of the transcendental functions of the Cyrix 387+ and found the
- relative error to be always below 0.6 ULPs. Cyrix also claims that its
- transcendental functions satisfy the monotonicity criterion [13], a claim not
- made by any of the competitors, which does not mean that the transcendental
- functions on the other 387-compatibles may not be monotonic, too.
- Monotonicity means that for all x1 > x2, it always follows that f(x1) >=
- f(x2) for an increasing function like sin on [0..pi/4]. Likewise, for a
- decreasing function like cos on [0..pi/4], for all x1 > x2, it follows that
- f(x1) <= f(x2).
-
- As previously noted, the Weitek Abacus 3167 and 4167 coprocessors implement
- only the basic arithmetic operations (add, subtract, negate, multiply,
- divide, square root) in hardware. Transcendental functions are performed via
- a software library provided by Weitek. For these library functions Weitek
- claims a maximum relative error of 5 ULPs [31,33]. This means that the last
- three bits in the mantissa of a double-precision result can be wrong. Note
- that the Intel 387 and compatible math coprocessors generate the
- transcendental functions with a small relative error with regard to the
- *extended double precision* format. Thus, when rounded to double-precision,
- their function values are nearly always 'exact'. The problem of 'double
- rounding' prevents them to be 'exact' in 100% of all cases. 387 type
- coprocessors in general have superior accuracy when compared with Weitek's
- coprocesssors.
-
- The test diskette distributed with early versions of the Cyrix 83D87
- contained a program (TRANCK) that checks the accuracy of the transcendental
- functions in the coprocessor against a more precise software arithmetic [16].
- I used this program to compare the accuracy of the transcendental functions
- on those 287/387/486 coprocessors/FPUs available to me. As TRANCK will not
- accept negative numbers as interval limits, I tested each function on an
- interval along the positive x-axis. The functions tested were F2XM1 (2**x-1),
- FSIN (sine), FCOS (cosine), FPTAN (tangent), FPATAN (arctangent), FYL2X (y *
- log2 (x)), and FYL2XP1 (y * log2 (x+1)). These are all the transcendental
- functions implemented on the 80387. Note that the square root (FSQRT) is
- *not* a transcendental function. For each function, 100,000 arguments were
- evaluated, with the arguments uniformly distributed within the interval
- tested.
-
- The EM87 emulator could not be checked with TRANCK, since the multiple
- precision package in TRANCK would always return with an error message
- immediately. However, the Franke387 emulator could be tested.
-
-
- In the test results below, the following statistics are detailed:
-
- %wrong is the percentage of results that differ from the 'exact'
- result (infinitely precise result rounded to 64 bits)
- ULP_hi is the number of results where the returned result was
- greater than the 'exact' (correctly rounded) result by
- one ULP (the numeric weight of the last mantissa bit,
- 2**-63 to 2**-64 depending of the size of the number).
- ULPs_hi is the number of results where the returned result was
- greater than the 'exact' result by two or more ULPs.
- ULP_lo is the number of results where the returned result was
- smaller than the 'exact' (correctly rounded) result by
- one ULP (the numeric weight of the last mantissa bit,
- 2**-63 to 2**-64 depending of the size of the number).
- ULPs_lo is the number of results where the returned result was
- smaller than the 'exact' result by two or more ULPs.
- max ULP err is the maximum deviation of a returned result from the
- 'exact' answer expressed in ULPs.
-
- Test results for accuracy of transcendental functions for double extended
- precision as returned by the program TRANCK. 100,000 trials per function:
-
- Franke387 V2.4 emulator
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 39.042 25301 708 13029 4 2
- COS 0,pi/4 75.714 49827 25887 0 0 3
- TAN 0,pi/4 76.976 14230 10029 24323 28394 9
- ATAN 0,1 55.826 26028 1529 24044 4225 4
- 2XM1 0,0.5 96.717 0 0 47910 48807 5
- YL2XP1 0,sqrt(2)-1 93.007 578 9 27416 65004 8
- YL2X 0.1,10 62.252 16817 4712 37082 3641 2953
-
-
- Microsoft's coprocessor emulator
- (part of MS-C and MS-Fortran libraries)
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
- COS 0,pi/4 N/A N/A N/A N/A N/A N/A
- TAN 0,pi/4 40.828 27764 1520 11445 99 2
- ATAN 0,1 32.307 18893 485 12530 299 2
- 2XM1 0,0.5 52.163 8585 189 37745 5644 3
- YL2XP1 0,sqrt(2)-1 88.801 4714 916 14239 68932 11
- YL2X 0.1,10 36.598 13813 3272 13866 5647 11
-
-
- INTEL 8087, 80287
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 N/A N/A N/A N/A N/A N/A
- COS 0,pi/4 N/A N/A N/A N/A N/A N/A
- TAN 0,pi/4 37.001 18756 524 17405 316 2
- ATAN 0,1 9.666 6065 0 3601 0 1
- 2XM1 0,0.5 19.920 0 0 19920 0 1
- YL2XP1 0,sqrt(2)-1 7.780 868 0 6912 0 1
- YL2X 0.1,10 1.287 723 0 564 0 1
-
-
- INTEL 80387
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 28.872 2467 0 26392 13 2
- COS 0,pi/4 27.213 27169 35 9 0 2
- TAN 0,pi/4 10.532 441 0 10091 0 1
- ATAN 0,1 7.088 2386 0 4691 1 2
- 2XM1 0,0.5 32.024 0 0 32024 0 1
- YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
- YL2X 0.1,10 13.020 6508 0 6512 0 1
-
-
- INTEL 387DX
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 28.873 2467 0 26393 13 2
- COS 0,pi/4 27.121 27090 22 9 0 2
- TAN 0,pi/4 10.711 457 0 10254 0 1
- ATAN 0,1 7.088 2386 0 4691 1 2
- 2XM1 0,0.5 32.024 0 0 32024 0 1
- YL2XP1 0,sqrt(2)-1 22.611 3461 0 19150 0 1
- YL2X 0.1,10 13.020 6508 0 6512 0 1
-
-
- ULSI 83C87
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 35.530 4989 6 30238 297 2
- COS 0,pi/4 43.989 11193 675 31393 728 2
- TAN 0,pi/4 48.539 18880 1015 26349 2295 3
- ATAN 0,1 20.858 62 0 20796 0 1
- 2XM1 0,0.5 21.257 4 0 21253 0 1
- YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2
- YL2X 0.1,10 13.603 9816 0 3787 0 1
-
-
- ULSI DX/DLC
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 42.691 1707 0 39972 1012 2
- COS 0,pi/4 43.989 11193 675 31393 728 2
- TAN 0,pi/4 48.479 18585 999 26565 2330 3
- ATAN 0,1 20.858 62 0 20796 0 1
- 2XM1 0,0.5 21.257 4 0 21253 0 1
- YL2XP1 0,sqrt(2)-1 27.893 9446 0 18213 234 2
- YL2X 0.1,10 13.603 9816 0 3787 0 1
-
-
- IIT 3C87
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 18.650 11171 0 7479 0 1
- COS 0,pi/4 7.700 3024 0 4676 0 1
- TAN 0,pi/4 20.973 9681 0 11291 1 2
- ATAN 0,1 19.280 13186 0 6094 0 1
- 2XM1 0,0.5 25.660 17570 0 8090 0 1
- YL2XP1 0,sqrt(2)-1 45.830 23503 1896 19654 777 3
- YL2X 0.1,10 10.888 5638 357 4845 48 3
-
-
- C&T 38700DX
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 1.821 1272 0 549 0 1
- COS 0,pi/4 23.358 12458 0 10901 0 1
- TAN 0,pi/4 17.178 10725 0 6453 0 1
- ATAN 0,1 9.359 7082 0 2277 0 1
- 2XM1 0,0.5 15.188 3039 0 12149 0 1
- YL2XP1 0,sqrt(2)-1 19.497 12109 0 7388 0 1
- YL2X 0.1,10 46.868 261 0 46607 0 1
-
-
- CYRIX 83D87
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 1.554 1015 0 539 0 1
- COS 0,pi/4 0.925 143 0 782 0 1
- TAN 0,pi/4 4.147 881 0 3266 0 1
- ATAN 0,1 0.656 229 0 427 0 1
- 2XM1 0,0.5 2.628 1433 0 1194 0 1
- YL2XP1 0,sqrt(2)-1 3.242 825 0 2417 0 1
- YL2X 0.1,10 0.931 256 0 675 0 1
-
- CYRIX 387+
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 1.486 864 0 622 0 1
- COS 0,pi/4 2.072 12 0 2060 0 1
- TAN 0,pi/4 0.602 63 0 539 0 1
- ATAN 0,1 0.384 12 0 372 0 1
- 2XM1 0,0.5 1.985 27 0 1958 0 1
- YL2XP1 0,sqrt(2)-1 3.662 1705 0 1957 0 1
- YL2X 0.1,10 0.764 367 0 397 0 1
-
-
- INTEL RapidCAD, Intel 486
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 16.991 1517 0 15474 0 1
- COS 0,pi/4 9.003 7603 0 1400 0 1
- TAN 0,pi/4 10.532 441 0 10091 0 1
- ATAN 0,1 7.078 2386 0 4691 1 2
- 2XM1 0,0.5 32.025 0 0 32025 0 1
- YL2XP1 0,sqrt(2)-1 21.800 533 0 21267 0 1
- YL2X 0.1,10 3.894 1879 0 2015 0 1
-
-
- INTEL Pentium
- max
- funct. interval %wrong ULP_hi ULPs_hi ULP_lo ULPs_lo ULP err
-
- SIN 0,pi/4 3.503 2937 0 567 0 1
- COS 0,pi/4 2.113 1737 0 376 0 1
- TAN 0,pi/4 5.030 2402 0 2628 0 1
- ATAN 0,1 3.088 1266 0 1822 0 1
- 2XM1 0,0.5 7.092 1014 0 6078 0 1
- YL2XP1 0,sqrt(2)-1 8.895 417 0 8478 0 1
- YL2X 0.1,10 6.784 71 0 6713 0 1
-
-
- Discussion of the transcendental function tests
- -----------------------------------------------
-
- The test results above indicate that all 80x87 compatibles do not exceed
- Intel's stated error bound of 3 ULPs for the transcendental functions.
- However, some coprocessors are more accurate than others. Rating the
- coprocessors according to the accuracy of their transcendental functions
- gives the following list (highest accuracy first): Cyrix 387+, Cyrix 83D87,
- Intel Pentium, Intel 486, Intel RapidCAD, Intel 80287(!), C&T 38700DX,
- Intel 387DX, Intel 80387, IIT 3C87, ULSI 83C87. The tests also show that
- the problems with excessive inaccuracy of the transcendental functions in
- early versions of the IIT coprocessors with errors of up to 8 ULPs [8] have
- been corrected. (According to [56], certain problems with the FPATAN
- instruction on the IIT 3C87 occurring under the UNIX version of AutoCAD
- were corrected in June, 1990.)
-
- Considering the coprocessor emulators, the Franke387 has acceptable accuracy
- for the FSIN, FCOS, and FPATAN instructions, taking into consideration that
- according to its documentation, Franke387 uses only 64 bits of precision for
- the intermediate results, while coprocessors typically use 68 bits and more.
- However, the larger error in the FPTAN, F2XM1, FYL2XP1, and especially the
- FYL2X operations show that the emulator doesn't use state-of-the-art
- algorithms, which ensure an error of only a very few ULPs even if no extra
- precise intermediate results are available. Microsoft's emulator, meanwhile,
- provides transcendental functions with rather good accuracy, except for the
- logarithmic operations, which contain some minor flaws.
-
-
-
-
- ======================================================
- Intel 387DX compatibility testing / The SMDIAG program
- ======================================================
-
- Chips and Technologies has included the program SMDIAG on the V1.0 diagnostic
- disk distributed with its SuperMATH 38700DX coprocessor. Its stated purpose
- is to test the compatibility of the computational results and flag settings
- returned by the C&T coprocessor with the Intel 387DX. However, the tests for
- the transcendental functions seem to have been tweaked to let the C&T 38700DX
- pass, while coprocessors like the Intel RapidCAD and the Cyrix 83D87 fail.
- Also, SMDIAG shows failure in the FSCALE test for the Intel RapidCAD, Cyrix
- 83D87, Cyrix 387+, and ULSI 83C87, even though they return the correct result
- according to Intel's documentation for the Intel 387DX (Intel's second
- generation 387), which is indeed returned by the 387DX. (SMDIAG apparently
- expects the result returned by the original Intel 80387.)
-
- Note that chip manufacturers often do quite bug fixes, so it wouldn't be
- surprising if somebody else, using different runs of the same manufacturer's
- chip, came up with different results than the ones below. The Intel 387 alone
- seems to have been produced in four different versions that can be told apart
- by software, and Cyrix, ULSI, and IIT have manufactured at least two versions
- each of their coprocessors. (The coprocessors I tested have the following
- manufacturing dates stamped on them. Intel 387DX: 91/49, C&T 38700DX: 92/19,
- Cyrix 387+: 92/11, Intel RapidCAD: 92/05, ULSI 83C87: 91/48, ULSI DX/DLC:
- 94/15, IIT 3C87: 92/20.)
-
- Results of running the SMDIAG program on 387-compatible coprocessors
- (p = passed, f = failed)
-
- Intel Intel Intel Cyrix Cyrix IIT ULSI ULSI C&T
- Test RapidCAD 387DX 80387 387+ 83D87 3C87 83C87 DX/DLC 38700
-
- 1 (fstore) f p p p f f f f p ##,%%
- 2 (fiall) p p p p p p f f p
- 3 (faddsub) p p p p p p p p p
- 4 (faddsub_nr) p p p p f f f f p %%
- 5 (faddsub_cp) p p p p f f f f p %%
- 6 (faddsub_dn) p p p p f f f f p %%
- 7 (faddsub_up) p p p p f f f f p %%,&&
- 8 (fmul) p p p p p f f f p
- 9 (fdivn) p p p p p p p p p
- 10 (fdiv) p p p p p p f f p
- 11 (fxch) p p p p p p p p p
- 12 (fyl2x) p p p f f f f f p ++
- 13 (fyl2xp1) f p p f f f f f p ++
- 14 (fsqrt) p p p p p p p p p
- 15 (fsincos) f p p f f f f f p ++
- 16 (fptan) p p p f p f f f p ++
- 17 (fpatan) p p p f f f f f p ++
- 18 (f2xm1) p p p f f f f f p ++
- 19 (fscale) f f p f f f f f p **
- 20 (fcom1) p p p p p f f p p
- 21 (fprem) p p p p p p p p p
- 22 (misc1) p p p p p f f p p
- 23 (misc3) p p p p p p p p p
- 24 (misc4) p p p p f f p p p %%
-
- failed modules: 4 1 0 7 12 16 17 15 0
-
-
- ## the failure of the Intel RapidCAD is caused by the fact that
- it stores the value of BCD INDEFINITE differently from the
- Intel 387DX. It uses FFFFC000000000000000, while the 387DX uses
- FFFF8000000000000000. However, both encodings are valid according
- to Intel's documentation, which defines the BCD INDEFINITE as
- FFFFUUUUUUUUUUUUUUUU, where U is undefined. So failure of the
- RapidCAD to deliver the same answer as the 387DX is not an
- "error", just a very slight incompatibility.
- ** the FSCALE errors reported for the Intel 387DX, Intel RapidCAD,
- Cyrix 83D87, Cyrix 387+, ULSI 83C87, and ULSI DX/DLC are due to
- a single 'wrong' result each returned by one of the FSCALE
- computations. SMDIAG expects the result returned by the first
- generation Intel 80387 (and, of course, the C&T 38700DX). However,
- this result is wrong according to Intel's documentation and the
- behavior was corrected in the second generation Intel 387DX.
- Therefore, the Intel RapidCAD, Cyrix 83D87, Cyrix 387+, ULSI
- 83C87, and ULSI DX/DLC return the correct result compatible
- with the Intel 387DX.
- %% Failures reported for the Cyrix 83D87 are due to the fact that it
- converts pseudodenormals contained in its registers to normalized
- numbers upon storing them to memory with the FSTP TBYTE PTR
- instruction. Intel's processors store pseudodenormals without
- 'normalizing' them. This is an incompatibility, but not an error,
- because both encodings will evaluate to the same value should
- they be reused in a calculation.
- && Two of the failures reported for the Cyrix 83D87 are actual
- errors where the Cyrix 83D87 fails to deliver the correct result.
- 1) control word = 0A7F (closure=proj., round=up, precision=53bit)
- ST(0) = 0001 ABCEF9876542101
- ST(1) = 0001 800000000345FFF
- instruction: FSUBRP ST(1), ST
- result should be: 0000 2BCEF987650EC800, status word = 3A30
- 83D87 returns: 0000 3BCEF987650EC000, status word = 3830
- 2) control word = 0A7F (closure=proj., round=up, precision=53bit)
- ST(0) = 0001 ABCEF9876542101
- ST(1) = 0001 800000000000000
- instruction: FSUB ST, ST(1)
- result should be: 0000 2BCEF98765432800, status word = 3A30
- 83D87 returns: 0000 3BCEF98765432000, status word = 3830
- ++ The failures for the test of transcendental functions are caused
- by the tested coprocessor returning results that differ from the
- ones returned by the Intel 387DX. On the Cyrix 83D87, Cyrix 387+,
- and Intel RapidCAD, this is simply due to the improved accuracy
- these coprocessors provide over the Intel 387DX. The failures of
- the IIT 3C87, ULSI 83C87, and ULSI DX/DLC are mainly due to the
- lesser accuracy in the transcendental functions of these
- coprocessors, but for the IIT 3C87 an additional source of
- failures is its inability to handle extended-precision denormals.
-
-
- Another compatibility issue that has been discussed on Usenet is the behavior
- of the math coprocessors under protected-mode operating systems. I have seen
- postings claiming that coprocessors from ULSI, IIT, and Cyrix locked up the
- machine when a protected mode operating system (several UNIX derivatives were
- also mentioned) was run on them. However, there have also been reports that
- several 486-based systems also have this problem, while others do not.
- Therefore, I think at least some of these problems are caused by poor
- motherboard design, especially wrong handling of error interrupts coming
- from the coprocessor. There could also be bugs in the exception handlers
- of the operating system.
-
- It seems to be confirmed by numerous postings on Internet that using an ULSI
- math coprocessor with protected mode operating systems will result in system
- lockup once tasks using the math coprocessor are run. This seems to be the
- result of a bug in the FSAVE and FRSTOR instructions in protected mode. These
- instructions are used to save and restore the math coprocessor state for the
- purpose of switching coprocessor contents between two tasks. OS/2 and Linux
- are two operating systems that have been explicitly mentioned as having
- locked up if a ULSI math coprocessor is used, but run fine with other math
- coprocessors. ULSI is supposedly aware of the problem. So far, no fixes seem
- to have been introduced in newer ULSI math coprocessors to remedy the problem.
- Therefore it seems unlikely that ULSI will eventually introduce these bug
- fixes.
-
-
-
-
-
- ==========
- References
- ==========
-
- [1] Schnurer, G.: Zahlenknacker im Vormarsch. c't 1992, Heft 4, Seiten 170-
- 186
-
- [2] Curnow, H.J.; Wichmann, B.A.: A synthetic benchmark. Computer Journal,
- Vol. 19, No. 1, 1976, pp. 43-49
-
- [3] Wichmann, B.A.: Validation code for the Whetstone benchmark. NPL Report
- DITC 107/88, National Physics Laboratory, UK, March 1988
-
- [4] Curnow, H.J.: Wither Whetstone? The Synthetic Benchmark after 15 Years.
- In: Aad van der Steen (ed.): Evaluating Supercomputers. London: Chapman
- and Hall 1990
-
- [5] Dongarra, J.J.: The Linpack Benchmark: An Explanation. In: Aad van der
- Steen (ed.): Evaluating Supercomputers. London: Chapman and Hall 1990
-
- [6] Dongarra, J.J.: Performance of Various Computers Using Standard Linear
- Equations Software. Report CS-89-85, Computer Science Department,
- University of Tennessee, March 11, 1992
-
- [7] Huth, N.: Dichtung und Wahrheit oder Datenblatt und Test. Design &
- Elektronik 1990, Heft 13, Seiten 105-110
-
- [8] Ungerer, B.: Sockelfolger. c't 1990, Heft 4, Seiten 162-163
-
- [9] Coonen, J.T.: Contributions to a Proposed Standard for Binary Floating-
- Point Arithmetic Ph.D. thesis, University of California, Berkeley, 1984
-
- [10] IEEE: IEEE Standard for Binary Floating-Point Arithmetic. SIGPLAN
- Notices, Vol. 22, No. 2, 1985, pp. 9-25
-
- [11] IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-
- 1985. New York, NY: Institute of Electrical and Electronics Engineers
- 1985
-
- [12] FasMath 83D87 Compatibility Report. Cyrix Corporation, Nov. 1989 Order
- No. B2004
-
- [13] FasMath 83D87 Accuracy Report. Cyrix Corporation, July 1990 Order No.
- B2002
-
- [14] FasMath 83D87 Benchmark Report. Cyrix Corporation, June 1990 Order No.
- B2004
-
- [15] FasMath 83D87 User's Manual. Cyrix Corporation, June 1990 Order No.
- L2001-003
-
- [16] Brent, R.P.: A FORTRAN multiple-precision arithmetic package. ACM
- Transactions on Mathematical Software, Vol. 4, No. 1, March 1978, pp.
- 57-70
-
- [17] 387DX User's Manual, Programmer's Reference. Intel Corporation, 1989
- Order No. 231917-002
-
- [18] Volder, J.E.: The CORDIC Trigonometric Computing Technique. IRE
- Transactions on Electronic Computers, Vol. EC-8, No. 5, September 1959,
- pp. 330-334
-
- [19] Walther, J.S.: A unified algorithm for elementary functions. AFIPS
- Conference Proceedings, Vol. 38, SJCC 1971, pp. 379-385
-
- [20] Esser, R.; Kremer, F.; Schmidt, W.G.: Testrechnungen auf der IBM 3090E
- mit Vektoreinrichtung. Arbeitsbericht RRZK-8803, Regionales
- Rechenzentrum an der Universit"at zu Kln, Februar 1988
-
- [21] McMahon, H.H.: The Livermore Fortran Kernels: A test of the numerical
- performance range. Technical Report UCRL-53745, Lawrence Livermore
- National Laboratory, USA, December 1986
-
- [22] Nave, R.: Implementation of Transcendental Functions on a Numerics
- Processor. Microprocessing and Microprogramming, Vol. 11, No. 3-4,
- March-April 1983, pp. 221-225
-
- [23] Yuen, A.K.: Intel's Floating-Point Processors. Electro/88 Conference
- Record, Boston, MA, USA, 10-12 May 1988, pp. 48/5-1 - 48/5-7
-
- [24] Stiller, A.; Ungerer, B.: Ausgerechnet. c't 1990, Heft 1, Seiten 90-92
-
- [25] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Professionell, Juni
- 1991, Seiten 214-237
- [26] Intel 80286 Hardware Reference Manual. Intel Corporation, 1987 Order
- No.210760-002
-
- [27] AMD 80C287 80-bit CMOS Numeric Processor. Advanced Micro Devices, June
- 1989 Order No. 11671B/0
-
- [28] Intel RapidCAD(tm) Engineering CoProcessor Performance Brief. Intel
- Corporation, 1992
-
- [29] i486(tm) Microprocessor Performance Report. Intel Corporation, April
- 1990 Order No. 240734-001
-
- [30] Intel486(tm) DX2 Microprocessor Performance Brief. Intel Corporation,
- March 1992 Order No. 241254-001
-
- [31] Abacus 3167 Floating-Point Coprocessor Data Book. Weitek Corporation,
- July 1990 DOC No. 9030
-
- [32] WTL 4167 Floating-Point Coprocessor Data Book. Weitek Corporation, July
- 1989 DOC No. 8943
-
- [33] Abacus Software Designer's Guide. Weitek Corporation, September 1989 DOC
- No. 8967
-
- [34] Stiller, A.: Cache & Carry. c't 1992, Heft 6, Seiten 118-130
-
- [35] Stiller, A.: Cache & Carry, Teil 2. c't 1992, Heft 7, Seiten 28-34
-
- [36] Palmer, J.F.; Morse, S.P.: Die mathematischen Grundlagen der Numerik-
- Prozessoren 8087/80287. Mnchen: tewi 1985
-
- [37] 80C187 80-bit Math Coprocessor Data Sheet. Intel Corporation, September
- 1989 Order No. 270640-003
-
- [38] IIT-2C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
-
- [39] Engineering note 4x4 matrix multiply transformation. IIT, 1989
-
- [40] Tscheuschner, E.: 4 mal 4 auf einen Streich. c't 1990, Heft 3, Seiten
- 266-276
-
- [41] Goldberg, D.: Computer Arithmetic. In: Hennessy, J.L.; Patterson, D.A.:
- Computer Architecture A Quantitative Approach. San Mateo, CA: Morgan
- Kaufmann 1990
-
- [42] 8087 Math Coprocessor Data Sheet. Intel Corporation, October 1989, Order
- No. 205835-007
-
- [43] 8086/8088 User's Manual, Programmer's and Hardware Reference. Intel
- Corporation, 1989 Order No. 240487-001
-
- [44] 80286 and 80287 Programmer's Reference Manual. Intel Corporation, 1987
- Order No. 210498-005
-
- [45] 80287XL/XLT CHMOS III Math Coprocessor Data Sheet. Intel Corporation,
- May 1990 Order No. 290376-001
-
- [46] Cyrix FasMath(tm) 82S87 Coprocessor Data Sheet. Cyrix Coporation, 1991
- Document 94018-00 Rev. 1.0
-
- [47] IIT-3C87 80-bit Numeric Co-Processor Data Sheet. IIT, May 1990
-
- [48] 486(tm)SX(tm) Microprocessor/ 487(tm)SX(tm) Math CoProcessor Data Sheet.
- Intel Corporation, April 1991. Order No. 240950-001
-
- [49] Schnurer, G.: Die gro"se Verlade. c't 1991, Heft 7, Seiten 55-57
-
- [50] Schnurer, G.: Eine 4 f"ur alle. c't 1991, Heft 6, Seite 25
-
- [51] Intel486(tm)DX Microprocessor Data Book. Intel Corporation, June 1991
- Order No. 240440-004
-
- [52] i486(tm) Microprocessor Hardware Reference Manual. Intel Corporation,
- 1990 Order No. 240552-001
-
- [53] i486(tm) Microprocessor Programmer's Reference Manual. Intel
- Corporation, 1990 Order No. 240486-001
-
- [54] Ungerer, B.: Kalte H"ute. c't 1992, Heft 8, Seiten 140-144
-
- [55] Ungerer, B.: Hei"se Sache. c't 1991, Heft 4, Seiten 104-108
-
- [56] Rosch, W.L.: Handfeste Hilfe oder Seifenblase? PC Profesionell, Juni
- 1991, Seiten 214-237
-
- [57] Niederkr"uger, W.: Lebendige Vergangenheit. c't 1990, Heft 12, Seiten
- 114-116
-
- [58] ULSI Math*Co Advanced Math Coprocessor Technical Specification. ULSI
- System, 5/92, Rev. E
-
- [59] 387(tm)DX Math CoProcessor Data Sheet. Intel Corporation, September
- 1990. Order No. 240448-003
-
- [60] 387(tm) Numerics Coprocessor Extension Data Sheet. Intel Corporation,
- February 1989. Order No. 231920-005
-
- [61] Koren, I.; Zinaty, O.: Evaluating Elementary Functions in a Numerical
- Coprocessor Based on Rational Approximations. IEEE Transactions on
- Computers, Vol. C-39, No. 8, August 1990, pp. 1030-1037
-
- [62] 387(tm) SX Math CoProcessor Data Sheet. Intel Corporation, November 1989
- Order No. 240225-005
-
- [63] Frenkel, G.: Coprocessors Speed Numeric Operations. PC-Week, August 27,
- 1990
-
- [64] Schnurer, G.; Stiller, A.: Auto-Matt. c't 1991, Heft 10, Seiten 94-96
-
- [65] Grehan, R.: FPU Face-Off. Byte, November 1990, pp. 194-200
-
- [66] Tang, P.T.P.: Testing Computer Arithmetic by Elementary Number Theory.
- Preprint MCS-P84-0889, Mathematics and Computer Science Division,
- Argonne National Laboratory, August 1989
-
- [67] Ferguson, W.E.: Selecting math coprocessors. IEEE Spectrum, July 1991,
- pp. 38-41
-
- [68] Schnabel, J.: Viermal 387. Computer Pers"onlich 1991, Heft 22, Seiten
- 153-156
-
- [69] Hofmann, J.: Starke Rechenknechte. mc 1990, Heft 7, Seiten 64-67
-
- [70] Woerrlein, H.; Hinnenberg, R.: Die Lust an der Power. Computer Live
- 1991, Heft 10, Seiten 138-149
-
- [71] email from Peter Forsberg (peterf@vnet.ibm.com), email from Alan Brown
- (abrown@Reston.ICL.COM)
-
- [72] email from Eric Johnson (johnsone%camax01@uunet.UU.NET), email from
- Jerry Whelan (guru@stasi.bradley.edu), email from Arto Viitanen
- (av@cs.uta.fi), email from Richard Krehbiel (richk@grebyn.com)
-
- [73] email from Fred Dunlap (cyrix!fred@texsun.Central.Sun.COM)
-
- [74] correspondence with Bengt Ask (f89ba@efd.lth.se)
-
- [75] email from Thomas Hoberg (tmh@prosun.first.gmd.de)
-
- [76] Microsoft Macro Assembler Programmer's Guide Version 6.0, Microsoft
- Corporation, 1991. Document No. LN06556-0291
-
- [77] FasMath EMC87 User's Manual, Rev. 2. Cyrix Corporation, February 1991
- Order No. 90018-00
-
- [78] Persson, C.: Die 32-Bit-Parade c't 1992, Heft 9, Seiten 150-156
-
- [79] email from Duncan Murdoch (dmurdoch@mast.QueensU.CA)
-
- [80] Fasmath 83S87 User's Manual. Cyrix Corporation, January 1990
- Order No. L2005-002
-
-
-
- ========================
- Manufacturer's addresses
- ========================
-
- Intel Corporation
- 2200 Mission College Blvd.
- Santa Clara, CA 95054
- USA
-
- IIT Integrated Information Technology, Inc.
- 2540 Mission College Blvd.
- Santa Clara, CA 95054
- USA
-
- ULSI Systems, Inc.
- 58 Daggett Drive
- San Jose, CA 95134
- USA
-
- Chips & Technologies, Inc.
- 3050 Zanker Road
- San Jose, CA 95134
- USA
-
- Weitek Corporation
- 1060 East Arques Avenue
- Sunnyvale, CA 94086
- USA
-
- AMD Advanced Microdevices, Inc.
- 901 Thompson Place
- P.O.B. 3453
- Sunnyvale, CA 94088-3453
- USA
-
- Cyrix Corporation
- P.O.B. 850118
- Richardson, TX 75085
- USA
-
-
-
- ===============================
- Appendix A: Test program source
- ===============================
-
- {$N+,E+}
- PROGRAM PCtrl;
-
- VAR B,c: EXTENDED;
- Precision, L: WORD;
-
- PROCEDURE SetPrecisionControl (Precision: WORD);
- (* This procedure sets the internal precision of the NDP. Available *)
- (* precision values: 0 - 24 bits (SINGLE) *)
- (* 1 - n.a. (mapped to single) *)
- (* 2 - 53 bits (DOUBLE) *)
- (* 3 - 64 bits (EXTENDED) *)
-
- VAR CtrlWord: WORD;
-
- BEGIN {SetPrecisionCtrl}
- IF Precision = 1 THEN
- Precision := 0;
- Precision := Precision SHL 8; { make mask for PC field in ctrl word}
- ASM
- FSTCW [CtrlWord] { store NDP control word }
- MOV AX, [CtrlWord] { load control word into CPU }
- AND AX, 0FCFFh { mask out precision control field }
- OR AX, [Precision] { set desired precision in PC field }
- MOV [CtrlWord], AX { store new control word }
- FLDCW [CtrlWord] { set new precision control in NDP }
- END;
- END; {SetPrecisionCtrl}
-
- BEGIN {main}
- FOR Precision := 1 TO 3 DO BEGIN
- B := 1.2345678901234567890;
- SetPrecisionControl (Precision);
- FOR L := 1 TO 20 DO BEGIN
- B := Sqrt (B);
- END;
- FOR L := 1 TO 20 DO BEGIN
- B := B*B;
- END;
- SetPrecisionControl (3); { full precision for printout }
- WriteLn (Precision, B:28);
- END;
- END.
-
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$N+,E+}
- PROGRAM RCtrl;
-
- VAR B,c: EXTENDED;
- RoundingMode, L: WORD;
-
-
- PROCEDURE SetRoundingMode (RCMode: WORD);
- (* This procedure selects one of four available rounding modes *)
- (* 0 - Round to nearest (default) *)
- (* 1 - Round down (towards negative infinity) *)
- (* 2 - Round up (towards positive infinity) *)
- (* 3 - Chop (truncate, round towards zero) *)
-
- VAR CtrlWord: WORD;
-
- BEGIN
- RCMode := RCMode SHL 10; { make mask for RC field in control word}
- ASM
- FSTCW [CtrlWord] { store NDP control word }
- MOV AX, [CtrlWord] { load control word into CPU }
- AND AX, 0F3FFh { mask out rounding control field }
- OR AX, [RCMode] { set desired precision in RC field }
- MOV [CtrlWord], AX { store new control word }
- FLDCW [CtrlWord] { set new rounding control in NDP }
- END;
- END;
-
- BEGIN
- FOR RoundingMode := 0 TO 3 DO BEGIN
- B := 1.2345678901234567890e100;
- SetRoundingMode (RoundingMode);
- FOR L := 1 TO 51 DO BEGIN
- B := Sqrt (B);
- END;
- FOR L := 1 TO 51 DO BEGIN
- B := -B*B;
- END;
- SetRoundingMode (0); { round to nearest for printout }
- WriteLn (RoundingMode, B:28);
- END;
- END.
-
-
- +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$N+,E+}
-
- PROGRAM DenormTs;
-
- VAR E: EXTENDED;
- D: DOUBLE;
- S: SINGLE;
-
- BEGIN
- WriteLn ('Testing support and printing of denormals');
- WriteLn;
- Write ('Coprocessor is: ');
- CASE Test8087 OF
- 0: WriteLn ('Emulator');
- 1: WriteLn ('8087 or compatible');
- 2: WriteLn ('80287 or compatible');
- 3: WriteLn ('80387 or compatible');
- END;
- WriteLn;
- S := 1.18e-38;
- S := S * 3.90625e-3;
- IF S = 0 THEN
- WriteLn ('SINGLE denormals not supported')
- ELSE BEGIN
- WriteLn ('SINGLE denormals supported');
- WriteLn ('SINGLE denormal prints as: ', S);
- WriteLn ('Denormal should be printed as 4.60943...E-0041');
- END;
- WriteLn;
- D := 2.24e-308;
- D := D * 3.90625e-3;
- IF D = 0 THEN
- WriteLn ('DOUBLE denormals not supported')
- ELSE BEGIN
- WriteLn ('DOUBLE denormals supported');
- WriteLn ('DOUBLE denormal prints as: ', D);
- WriteLn ('Denormal should be printed as 8.75...E-0311');
- END;
- WriteLn;
- E := 3.37e-4932;
- E := E * 3.90625e-3;
- IF E = 0 THEN
- WriteLn ('EXTENDED denormals not supported')
- ELSE BEGIN
- WriteLn ('EXTENDED denormals supported');
- WriteLn ('EXTENDED denormal prints as: ', E);
- WriteLn ('Denormal should be printed as 1.3164...E-4934');
- END;
- END.
-
-
-
- ====================================
- Appendix B: Benchmark program source
- ====================================
-
-
- ; FILE: APFELM4.ASM
- ; assemble with MASM /e APFELM4 or TASM /e APFELM4
-
-
- CODE SEGMENT BYTE PUBLIC 'CODE'
- ASSUME CS: CODE
-
- PAGE ,120
-
- PUBLIC APPLE87;
-
- APPLE87 PROC NEAR
- PUSH BP ; save caller's base pointer
- MOV BP, SP ; make new frame pointer
- PUSH DS ; save caller's data segment
- PUSH SI ; save register
- PUSH DI ; variables
- LDS BX, [BP+04] ; pointer to parameter record
- FINIT ; init 80x87 FSP->R0
- FILD WORD PTR [BX+02] ; maxrad FSP->R7
- FLD QWORD PTR [BX+08] ; qmax FSP->R6
- FSUB QWORD PTR [BX+16] ; qmax-qmin FSP->R6
- DEC WORD PTR [BX+04] ; ymax-1
- FIDIV WORD PTR [BX+04] ; (qmax-qmin)/(ymax-1)FSP->R6
- FSTP QWORD PTR [BX+16] ; save delta_q FSP->R7
- FLD QWORD PTR [BX+24] ; pmax FSP->R6
- FSUB QWORD PTR [BX+32] ; pmax-pmin FSP->R6
- DEC WORD PTR [BX+06] ; xmax-1
- FIDIV WORD PTR [BX+06] ; delta_p FSP->R6
- MOV AX, [BX] ; save maxiter,[BX] needed for
- MOV [BX+2], AX ; 80x87 status now
- XOR BP, BP ; y=0
- FLD QWORD PTR [BX+08] ; qmax FSP->R5
- CMP WORD PTR [BX+40], 0 ; fast mode on 8087 desired ?
- JE yloop ; no, normal mode
- FSTCW [BX] ; save NDP control word
- AND WORD PTR [BX], 0FCFFh; set PCTRL = single-precision
- FLDCW [BX] ; get back NDP control word
- yloop: XOR DI, DI ; x=0
- FLD QWORD PTR [BX+32] ; pmin FSP->R4
- xloop: FLDZ ; j**2= 0 FSP->R3
- FLDZ ; 2ij = 0 FSP->R2
- FLDZ ; i**2= 0 FSP->R1
- MOV CX, [BX+2] ; maxiter
- MOV DL, 41h ; mask for C0 and C3 cond.bits
- iteration: FSUB ST, ST(2) ; i**2-j**2 FSP->R1
- FADD ST, ST(3) ; i**2-j**2+p = i FSP->R1
- FLD ST(0) ; duplicate i FSP->R0
- FMUL ST(1), ST ; i**2 FSP->R0
- FADD ST, ST(0) ; 2i FSP->R0
- FXCH ST(2) ; 2*i*j FSP->R0
- FADD ST, ST(5) ; 2*i*j+q = j FSP->R0
- FMUL ST(2), ST ; 2*i*j FSP->R0
- FMUL ST, ST(0) ; j**2 FSP->R0
- FST ST(3) ; save j**2 FSP->R0
- FADD ST, ST(1) ; i**2+j**2 FSP->R0
- FCOMP ST(7) ; i**2+j**2 > maxrad? FSP->R1
- FSTSW [BX] ; save 80x87 cond.codeFSP->R1
- TEST BYTE PTR [BX+1], DL ; test carry and zero flags
- LOOPNZ iteration ; until maxiter if not diverg.
- MOV DX, CX ; number of loops executed
- NEG CX ; carry set if CX <> 0
- ADC DX, 0 ; adjust DX if no. of loops<>0
-
- ; plot point here (DI = X, BP = y, DX has the color)
-
- FSTP ST(0) ; pop i**2 FSP->R2
- FSTP ST(0) ; pop 2ij FSP->R3
- FSTP ST(0) ; pop j**2 FSP->R4
- FADD ST,ST(2) ; p=p+delta_p FSP->R4
- INC DI ; x:=x+1
- CMP DI, [BX+6] ; x > xmax ?
- JBE xloop ; no, continue on same line
- FSTP ST(0) ; pop p FSP->R5
- FSUB QWORD PTR [BX+16] ; q=q-delta_q FSP->R5
- INC BP ; y:=y+1
- CMP BP, [BX+4] ; y > ymax ?
- JBE yloop ; no, picture not done yet
-
- groesser: POP DI ; restore
- POP SI ; register variables
- POP DS ; restore caller's data segm.
- POP BP ; save caller's base pointer
- RET 4 ; pop parameters and return
- APPLE87 ENDP
-
- CODE ENDS
-
- END
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- UNIT Time;
-
- INTERFACE
-
- FUNCTION Clock: LONGINT; { same as VMS; time in milliseconds }
-
-
- IMPLEMENTATION
-
- FUNCTION Clock: LONGINT; ASSEMBLER;
- ASM
- PUSH DS { save caller's data segment }
- XOR DX, DX { initialize data segment to }
- MOV DS, DX { access ticker counter }
- MOV BX, 46Ch { offset of ticker counter in segm.}
- MOV DX, 43h { timer chip control port }
- MOV AL, 4 { freeze timer 0 }
- PUSHF { save caller's int flag setting }
- STI { allow update of ticker counter }
- LES DI, DS:[BX] { read BIOS ticker counter }
- OUT DX, AL { latch timer 0 }
- LDS SI, DS:[BX] { read BIOS ticker counter }
- IN AL, 40h { read latched timer 0 lo-byte }
- MOV AH, AL { save lo-byte }
- IN AL, 40h { read latched timer 0 hi-byte }
- POPF { restore caller's int flag }
- XCHG AL, AH { correct order of hi and lo }
- MOV CX, ES { ticker counter 1 in CX:DI:AX }
- CMP DI, SI { ticker counter updated ? }
- JE @no_update { no }
- OR AX, AX { update before timer freeze ? }
- JNS @no_update { no }
- MOV DI, SI { use second }
- MOV CX, DS { ticker counter }
- @no_update:NOT AX { counter counts down }
- MOV BX, 36EDh { load multiplier }
- MUL BX { W1 * M }
- MOV SI, DX { save W1 * M (hi) }
- MOV AX, BX { get M }
- MUL DI { W2 * M }
- XCHG BX, AX { AX = M, BX = W2 * M (lo) }
- MOV DI, DX { DI = W2 * M (hi) }
- ADD BX, SI { accumulate }
- ADC DI, 0 { result }
- XOR SI, SI { load zero }
- MUL CX { W3 * M }
- ADD AX, DI { accumulate }
- ADC DX, SI { result in DX:AX:BX }
- MOV DH, DL { move result }
- MOV DL, AH { from DL:AX:BX }
- MOV AH, AL { to }
- MOV AL, BH { DX:AX:BH }
- MOV DI, DX { save result }
- MOV CX, AX { in DI:CX }
- MOV AX, 25110 { calculate correction }
- MUL DX { factor }
- SUB CX, DX { subtract correction }
- SBB DI, SI { factor }
- XCHG AX, CX { result back }
- MOV DX, DI { to DX:AX }
- POP DS { restore caller's data segment }
- END;
-
-
- BEGIN
- Port [$43] := $34; { need rate generator, not square wave}
- Port [$40] := 0; { generator as prog. by some BIOSes }
- Port [$40] := 0; { for timer 0 }
- END. { Time }
-
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$A+,B-,R-,I-,V-,N+,E+}
- PROGRAM PeakFlop;
-
- USES Time;
-
- TYPE ParamRec = RECORD
- MaxIter, MaxRad, YMax, XMax: WORD;
- Qmax, Qmin, Pmax, Pmin: DOUBLE;
- FastMod: WORD;
- PlotFkt: POINTER;
- FLOPS:LONGINT;
- END;
-
- VAR Param: ParamRec;
- Start: LONGINT;
-
-
- {$L APFELM4.OBJ}
-
- PROCEDURE Apple87 (VAR Param: ParamRec); EXTERNAL;
-
-
- BEGIN
- WITH Param DO BEGIN
- MaxIter:= 50;
- MaxRad := 30;
- YMax := 30;
- XMax := 30;
- Pmin :=-2.1;
- Pmax := 1.1;
- Qmin :=-1.2;
- Qmax := 1.2;
- FastMod:= Word (FALSE);
- PlotFkt:= NIL;
- Flops := 0;
- END;
- Start := Clock;
- Apple87 (Param); { executes 104002 FLOP }
- Start := Clock - Start; { elapsed time in milliseconds }
- WriteLn ('Peak-MFLOPS: ', 104.002 / Start);
- END.
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- ; FILE: M4X4.ASM
- ;
- ; assemble with TASM /e M4X4 or MASM /e M4X4
-
- CODE SEGMENT BYTE PUBLIC 'CODE'
-
- ASSUME CS:CODE
-
- PUBLIC MUL_4x4
- PUBLIC IIT_MUL_4x4
-
-
- FSBP0 EQU DB 0DBh, 0E8h ; declare special IIT
- FSBP1 EQU DB 0DBh, 0EBh ; instructions
- FSBP2 EQU DB 0DBh, 0EAh
- F4X4 EQU DB 0DBh, 0F1h
-
-
- ;---------------------------------------------------------------------
- ;
- ; MUL_4x4 multiplicates a four-by-four matrix by an array of four
- ; dimensional vectors. This operation is needed for 3D transformations
- ; in graphics data processing. There are arrays for each component of
- ; a vector. Thus there is an ; array containing all the x components,
- ; another containing all the y components and so on. Each component is
- ; an 8 byte IEEE floating-point number. Two indices into the array of
- ; vectors are given. The first is the index of the vector that will be
- ; processed first, the second is the index of the vector processed
- ; last.
- ;
- ;---------------------------------------------------------------------
-
- MUL_4x4 PROC NEAR
-
- AddrX EQU DWORD PTR [BP+24] ; address of X component array
- AddrY EQU DWORD PTR [BP+20] ; address of Y component array
- AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
- AddrW EQU DWORD PTR [BP+12] ; address of W component array
- AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transform. mat.
- F EQU WORD PTR [BP+6] ; first vector to process
- K EQU WORD PTR [BP+4] ; last vector to process
- RetAddr EQU WORD PTR [BP+2] ; return address saved by call
- SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
- SavdDS EQU WORD PTR [BP-2] ; caller's data segment
-
- PUSH BP ; save TURBO-Pascal frame ptr
- MOV BP, SP ; new frame pointer
- PUSH DS ; save TURBO-Pascal data segmnt
-
- MOV CX, K ; final index
- SUB CX, F ; final index - start index
- JNC $ok ; must not
- JMP $nothing ; be negative
- $ok: INC CX ; number of elements
-
- MOV SI, F ; init offset into arrays
- SHL SI, 1 ; each
- SHL SI, 1 ; element
- SHL SI, 1 ; has 8 bytes
-
- LDS DI, AddrT ; addr. of transformation mat.
- FLD QWORD PTR [DI] ; load a[0,0] = R7
- FLD QWORD PTR [DI+8] ; load a[0,1] = R6
-
- $mat_mul: LES BX, AddrX ; addr. of x component array
- FLD QWORD PTR ES:[BX+SI] ; load x[a] = R5
- LES BX, AddrY ; addr. of y component array
- FLD QWORD PTR ES:[BX+SI] ; load y[a] = R4
- LES BX, AddrZ ; addr. of z component array
- FLD QWORD PTR ES:[BX+SI] ; load z[a] = R3
- LES BX, AddrW ; addr. of w component array
- FLD QWORD PTR ES:[BX+SI] ; load w[a] = R2
-
- FLD ST(5) ; load a[0,0] = R1
- FMUL ST, ST(4) ; a[0,0] * x[a] = R1
- FLD ST(5) ; load a[0,1] = R0
- FMUL ST, ST(4) ; a[0,1] * y[a] = R0
- FADDP ST(1), ST ; a[0,0]*x[a]+a[0,1]*y[a]=R1
- FLD QWORD PTR [DI+16] ; load a[0,2] = R0
- FMUL ST, ST(3) ; a[0,2] * z[a] = R0
- FADDP ST(1), ST ; a[0,0]*x[a]...a[0,2]*z[a]=R1
- FLD QWORD PTR [DI+24] ; load a[0,3] = R0
- FMUL ST, ST(2) ; a[0,3] * w[a] = R0
- FADDP ST(1), ST ; a[0,0]*x[a]...a[0,3]*w[a]=R1
- LES BX, AddrX ; get address of x vector
- FSTP QWORD PTR ES:[BX+SI] ; write new x[a]
-
- FLD QWORD PTR [DI+32] ; load a[1,0] = R1
- FMUL ST, ST(4) ; a[1,0] * x[a] = R1
- FLD QWORD PTR [DI+40] ; load a[1,1] = R0
- FMUL ST, ST(4) ; a[1,1] * y[a] = R0
- FADDP ST(1), ST ; a[1,0]*x[a]+a[1,1]*y[a]=R1
- FLD QWORD PTR [DI+48] ; load a[1,2] = R0
- FMUL ST, ST(3) ; a[1,2] * z[a] = R0
- FADDP ST(1), ST ; a[1,0]*x[a]...a[1,2]*z[a]=R1
- FLD QWORD PTR [DI+56] ; load a[1,3] = R0
- FMUL ST, ST(2) ; a[1,3] * w[a] = R0
- FADDP ST(1), ST ; a[1,0]*x[a]...a[1,3]*w[a]=R1
- LES BX, AddrY ; get address of y vector
- FSTP QWORD PTR ES:[BX+SI] ; write new y[a]
-
- FLD QWORD PTR [DI+64] ; load a[2,0] = R1
- FMUL ST, ST(4) ; a[2,0] * x[a] = R1
- FLD QWORD PTR [DI+72] ; load a[2,1] = R0
- FMUL ST, ST(4) ; a[2,1] * y[a] = R0
- FADDP ST(1), ST ; a[2,0]*x[a]+a[2,1]*y[a]=R1
- FLD QWORD PTR [DI+80] ; load a[2,2] = R0
- FMUL ST, ST(3) ; a[2,2] * z[a] = R0
- FADDP ST(1), ST ; a[2,0]*x[a]...a[2,2]*z[a]=R1
- FLD QWORD PTR [DI+88] ; load a[2,3] = R0
- FMUL ST, ST(2) ; a[2,3] * w[a] = R0
- FADDP ST(1), ST ; a[2,0]*x[a]...a[2,3]*w[a]=R1
- LES BX, AddrZ ; get address of z vector
- FSTP QWORD PTR ES:[BX+SI] ; write new z[a]
-
- FLD QWORD PTR [DI+96] ; load a[3,0] = R1
- FMULP ST(4), ST ; a[3,0] * x[a] = R5
- FLD QWORD PTR [DI+104] ; load a[3,1] = R1
- FMULP ST(3), ST ; a[3,1] * y[a] = R4
- FLD QWORD PTR [DI+112] ; load a[3,2] = R1
- FMULP ST(2), ST ; a[3,2] * z[a] = R3
- FLD QWORD PTR [DI+120] ; load a[3,3] = R1
- FMULP ST(1), ST ; a[3,3] * w[a] = R2
- FADDP ST(1), ST ; a[3,3]*w[a]+a[3,2]*z[a]=R3
- FADDP ST(1), ST ; a[3,3]*w[a]...a[3,1]*y[a]=R4
- FADDP ST(1), ST ; a[3,3]*w[a]...a[3,0]*x[a]=R5
- LES BX, AddrW ; get address of w vector
- FSTP QWORD PTR ES:[BX+SI] ; write new w[a]
-
- ADD SI, 8 ; new offset into arrays
- DEC CX ; decrement element counter
- JZ $done ; no elements left, done
- JMP $mat_mul ; transform next vector
-
- $done: FSTP ST(0) ; clear
- FSTP ST(0) ; FPU stack
- $nothing: POP DS ; restore TP data segment
- POP BP ; restore TP frame pointer
- RET 24 ; pop parameters and return
-
- MUL_4X4 ENDP
-
-
- ;---------------------------------------------------------------------
- ;
- ; IIT_MUL_4x4 multiplicates a four-by-four matrix by an array of four
- ; dimensional vectors. This operation is needed for 3D transformations
- ; in graphics data processing. There are arrays for each component of
- ; a vector. Thus there is an array containing all the x components,
- ; another containing all the y components and so on. Each component is
- ; an 8 byte IEEE floating-point number. Two indices into the array of
- ; vectors are given. The first is the index of the vector that will be
- ; processed first, the second is the index of the vector processed
- ; last. This subroutine uses the special instructions only available
- ; on IIT coprocessors to provide fast matrix multiply capabilities.
- ; So make sure to use it only on IIT coprocessors.
- ;
- ;---------------------------------------------------------------------
-
- IIT_MUL_4x4 PROC NEAR
-
- AddrX EQU DWORD PTR [BP+24] ; address of X component array
- AddrY EQU DWORD PTR [BP+20] ; address of Y component array
- AddrZ EQU DWORD PTR [BP+16] ; address of Z component array
- AddrW EQU DWORD PTR [BP+12] ; address of W component array
- AddrT EQU DWORD PTR [BP+8] ; addr. of 4x4 transf. matrix
- F EQU WORD PTR [BP+6] ; first vector to process
- K EQU WORD PTR [BP+4] ; last vector to process
- RetAddr EQU WORD PTR [BP+2] ; return address saved by call
- SavdBP EQU WORD PTR [BP+0] ; saved frame pointer
- SavdDS EQU WORD PTR [BP-2] ; caller's data segment
- Ctrl87 EQU WORD PTR [BP-4] ; caller's 80x87 control word
-
- PUSH BP ; save TURBO-Pascal frame ptr
- MOV BP, SP ; new frame pointer
- PUSH DS ; save TURBO-Pascal data seg.
- SUB SP, 2 ; make local variabe
- FSTCW [Ctrl87] ; save 80x87 ctrl word
- LES SI, AddrT ; ptr to transformation matrix
- FINIT ; initialize coprocessor
- FSBP2 ; set register bank 2
- FLD QWORD PTR ES:[SI] ; load a[0,0]
- FLD QWORD PTR ES:[SI+32] ; load a[1,0]
- FLD QWORD PTR ES:[SI+64] ; load a[2,0]
- FLD QWORD PTR ES:[SI+96] ; load a[3,0]
- FLD QWORD PTR ES:[SI+8] ; load a[0,1]
- FLD QWORD PTR ES:[SI+40] ; load a[1,1]
- FLD QWORD PTR ES:[SI+72] ; load a[2,1]
- FLD QWORD PTR ES:[SI+104] ; load a[3,1]
- FINIT ; initialize coprocessor
- FSBP1 ; set register bank 1
- FLD QWORD PTR ES:[SI+16] ; load a[0,2]
- FLD QWORD PTR ES:[SI+48] ; load a[1,2]
- FLD QWORD PTR ES:[SI+80] ; load a[2,2]
- FLD QWORD PTR ES:[SI+112] ; load a[3,2]
- FLD QWORD PTR ES:[SI+24] ; load a[0,3]
- FLD QWORD PTR ES:[SI+56] ; load a[1,3]
- FLD QWORD PTR ES:[SI+88] ; load a[2,3]
- FLD QWORD PTR ES:[SI+120] ; load a[3,3]
-
- ; transformation matrix loaded
-
- MOV AX, F ; index of first vector
- MOV DX, K ; index of last vector
-
- MOV BX, AX ; index 1st vector to process
- MOV CL, 3 ; component has 8 (2**3) bytes
- SHL BX, CL ; compute offset into arrays
-
- FINIT ; initialize coprocessor
- FSBP0 ; set register bank 0
-
- $mat_loop:LES SI, AddrW ; addr. of W component array
- FLD QWORD PTR ES:[SI+BX] ; W component current vector
- LES SI, AddrZ ; addr. of Z component array
- FLD QWORD PTR ES:[SI+BX] ; Z component current vector
- LES SI, AddrY ; addr. of Y component array
- FLD QWORD PTR ES:[SI+BX] ; Y component current vector
- LES SI, AddrX ; addr. of X component array
- FLD QWORD PTR ES:[SI+BX] ; X component current vector
- F4X4 ; mul 4x4 matrix by 4x1 vector
- INC AX ; next vector
- MOV DI, AX ; next vector
- SHL DI, CL ; offset of vector into arrays
-
- FSTP QWORD PTR ES:[SI+BX] ; store X comp. of curr. vect.
- LES SI, AddrY ; address of Y component array
- FSTP QWORD PTR ES:[SI+BX] ; store Y comp. of curr. vect.
- LES SI, AddrZ ; address of Z component array
- FSTP QWORD PTR ES:[SI+BX] ; store Z comp. of curr. vect.
- LES SI, AddrW ; address of W component array
- FSTP QWORD PTR ES:[SI+BX] ; store W comp. of curr. vect.
-
- MOV BX, DI ; ofs nxt vect. in comp. arrays
- CMP AX, DX ; nxt vector past upper bound?
- JLE $mat_loop ; no, transform next vector
- FLDCW [Ctrl87] ; restore orig 80x87 ctrl word
-
- ADD SP, 2 ; get rid of local variable
- POP DS ; restore TP data segment
- POP BP ; restore TP frame pointer
- RET 24 ; pop parameters and return
- IIT_MUL_4x4 ENDP
-
- CODE ENDS
-
- END
-
- ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-
- {$N+,E+}
-
- PROGRAM Trnsform;
-
- USES Time;
-
- CONST VectorLen = 8190;
-
- TYPE Vector = ARRAY [0..VectorLen] OF DOUBLE;
- VectorPtr = ^Vector;
- Mat4 = ARRAY [1..4, 1..4] OF DOUBLE;
-
- VAR X: VectorPtr;
- Y: VectorPtr;
- Z: VectorPtr;
- W: VectorPtr;
- T: Mat4;
- K: INTEGER;
- L: INTEGER;
- First: INTEGER;
- Last: INTEGER;
- Start: LONGINT;
- Elapsed:LONGINT;
-
- PROCEDURE MUL_4X4 (X, Y, Z, W: VectorPtr;
- VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
- PROCEDURE IIT_MUL_4X4 (X, Y, Z, W: VectorPtr;
- VAR T: Mat4; First, Last: INTEGER); EXTERNAL;
-
- {$L M4X4.OBJ}
-
- BEGIN
- WriteLn ('Test8087 = ', Test8087);
- New (X);
- New (Y);
- New (Z);
- New (W);
- FOR L := 1 TO VectorLen DO BEGIN
- X^ [L] := Random;
- Y^ [L] := Random;
- Z^ [L] := Random;
- W^ [L] := Random;
- END;
- X^ [0] := 1;
- Y^ [0] := 1;
- Z^ [0] := 1;
- W^ [0] := 1;
- FOR K := 1 TO 4 DO BEGIN
- FOR L := 1 TO 4 DO BEGIN
- T [K, L] := (K-1)*4 + L;
- END;
- END;
- First := 0;
- Last := 8190;
- Start := Clock;
- MUL_4X4 (X, Y, Z, W, T, First, Last);
- { IIT_MUL_4X4 (X, Y, Z, W, T, First, Last); }
- Elapsed := Clock - Start;
- WriteLn ('Number of vectors: ', Last-First+1);
- WriteLn ('Time: ', Elapsed, ' ms');
- WriteLn ('Equivalent to ', (28.0*(Last-First+1)/1e6)/
- (Elapsed*1e-3):0:4, ' MFLOPS');
- WriteLn;
- WriteLn ('Last vector:');
- WriteLn;
- WriteLn (X^[Last]);
- WriteLn (Y^[Last]);
- WriteLn (Z^[Last]);
- WriteLn (W^[Last]);
- END
-