The 80860 Microprocessor, recently introduced by Intel (see "The Intel 860," in the April issue of MIPS), is a high performance RISC chip with integrated floating-point and 3D graphics units. In this two-part article, I'll examine the architectural details of the chip--the functional units, memory organization, the instruction set, and other topics.
The 860's large number of transistors--over a million on a 10- x 15-mm die--sets an integration record for single-chip processors. The 80860 sports about 300,000 transistors, and the MIPS Computer Systems R2000 RISC processor uses fewer than 100,000. The number of transistors on the 860 allow the three key architectural elements of a workstation-class chip set--CPU, floating-point processor, and memory management--to be combined on one chip, making low-cost, high-function systems possible.
The other defining characteristic of the 860 is its high clock speed. It will initially be offered at 33 Mhz and 40 Mhz, with a 50 Mhz version to follow next year. At typical RISC speeds of 12 to 20 Mhz, processors like the SPARC and Motorola 88000 still outperform 33-Mhz 80386's and 68030's. A solid RISC implementation such as the 860, running at CISC-like speeds of 33 Mhz and up, can be expected to perform very well.
AN ARCHITECTURAL OVERVIEW
The 860 is made up of a single paging-based memory management unit, with instruction and data caches, serving both a 32-bit RISC processing core (known in RISC circles as the integer unit) and a fast 64-bit floating-point unit.
Figure 1
shows a simplified version of the data and instruction flow on the chip.
The external data bus is 64 bits wide, a first in widely available microprocessors. Internal buses for the integer unit are all 32 bits, and there are 32 integer unit registers, each 32 bits wide. The floating-point unit, on the other hand, has all 64-bit data paths, except for a 128 bit connection between the data cache and the floating-point registers. The floating-point registers can be accessed as thirty-two 32-bit registers, sixteen 64-bit registers, or eight 128-bit registers.
Address calculations are done in the integer unit and sent to a paging unit for page entry look-up and protection checking. The 860's 32 bit addressing (which matches that of the 80386 and the upcoming 80486 for coprocessing compatibility) allows access to up to 4 Gigabytes of data. Instructions are stored in a 4-Kbyte, read-only cache; data is stored in an 8-Kbyte write-back cache. Memory writes go to one of two 128 bit write buffers and are written to main memory only when there is no other bus traffic.
Note that a single 64 bit bus goes into the instruction cache, while two 32 bit buses come out, one each for the integer unit and floating-point unit. This allows the 860 to fetch two instructions, such as a loop control instruction and a floating-point instruction, in one fetch, then simultaneously feed one instruction each to the data and floating-point units. In this dual-instruction mode, the instructions will execute in parallel.
The problem with dual-instruction mode is that the code must be arranged specially--an integer instruction paired with a floating-point instruction, aligned on a 64 bit boundary--at compile time. Dual-instruction mode will not benefit a user who happens to be running a word processor and a math calculation at the same time. The paired instruction technique will most likely be used during intensive floating-point calculations, so the integer unit can handle loop control simultaneously. There is also a floating-point dual-operation mode, in which the floating-point adder and multiplier both work at once, producing as many as two results per cycle. This mode will be more fully described in the section on floating-point processing in the second part of this article.
All 860 I/O is memory-mapped (ie., there are no separate instructions such as IN and OUT for I/O access). In this area the 860 is not completely compatible with the 80386. It is the responsibility of software to ensure that memory accesses by I/O operations do not invalidate information that is stored in the data cache (instruction pages should generally be read-only).
PERFORMANCE
There has been some confusion surrounding performance claims made by Intel for the 860. As a RISC chip executing almost one instruction per cycle, a 40 Mhz 860 should run at close to 40 MIPS. Similarly, the fast floating-point unit, running at 40 Mhz, should be able to deliver between about 20 and 40 MFLOPS (millions of floating-point operations per second), depending on whether it needs two cycles for a given operation, or only one. Yet Intel claims up to 80 MFLOPS for single-precision floating-point operation and up to 120 "MOPS" (millions of operations per second) overall throughput.
The 80 MFLOPS performance occurs when using pipelined instructions in dual-instruction mode, with both the adder and the multiplier running at once. With careful programming, this should actually occur during certain intensive calculations such as matrix manipulation and three-dimensional graphics. The 120 MOPS figure is obtained by adding the 40 MIPS that the integer unit can theoretically produce to the 80 MFLOPS figure, as may occur if bookkeeping for a loop occurs simultaneously with dual-operation mode. These high figures are for ideal conditions only, and they do not consider slowdowns caused by cache misses, branches that have not been fully optimized, and page swapping to and from disk.
Also, peak-performance figures apply only to pipelined floating-point operations and not to standard business-type calculations, business graphics, and so on. For these and other nonspecialized purposes, a 40 Mhz 860 is simply a RISC processor with 30+ MIPS of integer performance and 30+ MFLOPS of floating-point speed. Compared to even a fast 25 Mhz 80386 with about 7 MIPS of integer performance and little floating-point speed, this is not bad.
MEMORY ORGANIZATION
One of the keys to improving performance for fast 80386-based and other machines has been the use of off-chip static RAM caches to reduce the number of wait states for a typical memory access. The 860 uses onboard caching and built-in support for static column RAM to offer high performance without the expense of an off-chip cache.
Typical off-chip cache sizes on popular microcomputers range from the 32 Kbyte write-through cache supported by the Intel 82385 put to the 256 Kbyte write-back cache available with the Everex Step series. The Motorola 88000 chip set usually includes tow 16 Kbyte 88200 memory management chips, one used to cache instructions, the other for data. By these standards, the Intel chip's onboard caches are small -- 8 Kbytes for data, 4 Kbytes for instructions. (Note that provision has been made in the epsr register for larger on-chip data caches, as described below.) However, the 860 processor's onboard caches are much larger than other caches integrated on current CISC CPU's, such as the two 256 byte caches found on the Motorola 68030.
The 860's caches are highly integrated into the chip's functions. The data cache is a "write-back" cache; this is, when a value in the cache is updated, main memory is not updated at the same time. The instruction cache, on the other hand, is read-only and must be flushed if code stored in main memory is changed by self-modifying code, or when a new page is read in from disk over a page with values stored in the cache. Another important cache is the Translation Lookaside Buffer (TLB), used for paging support, which operates the same way on the 860 as on the 80386.
Intel has claimed a peak transfer rate of 1 Gigabyte/second from the on-chip caches, at 40 Mhz. This can occur, but only in specific circumstances. The instruction cache is 64 bits wide, so it can hold both a floating-point and an integer instruction for simultaneous execution in dual-instruction mode. The data cache is 128 bits wide to support dual 64 bit transfers to the floating-point registers. Both caches have a line size of 32 bytes -- that is, a cache miss causes the processor to quickly read not only the needed value but all the bytes in the same 32 byte line. Both caches can also function at once, simultaneously providing 320 Mbytes/second from the instruction cache (dual-instruction mode) and 640 Mbytes from the data cache (dual-precision operations) for a peak internal transfer rate of nearly 1 Gigabyte/second.
To support paged memory, most systems must put a comparator on the address bus to compare subsequent accesses, determine if they fall on the same RAM page, and signal the bus if they do. On the 860, this comparator is built in, and the Next Near (NENE#) pin is asserted on paged accesses. Positions in the directory base register indicate the current system's RAM page size and help determine when NENE# should be asserted. In most cases, the combination of on-board caching and static column RAM support should greatly reduce the benefit of adding an off-chip cache.
GENERAL-PURPOSE REGISTERS
Compared to a general-purpose processor such as the 80386, there are several simplifying factors that eliminate register complexity. first, load/store architecture eliminates the requirement that registers serve special functions for memory addressing. In addition, since a flat 4 Gigabyte address space is used, segment registers and segment descriptors are not required. Finally, the hardware-supported debugging capabilities of the 860 are less robust than for the 80386 and require only a single register.
The term "general-purpose register" is much more descriptive of the 860 than of CISC architectures. Only r0 among the integer registers and f0 and f1 in the floating-point register file are special: They always return 0 when read and ignore any values stored in them. The integer registers are used for address and scalar (non-floating-point) integer computations. The floating-point registers can be used as 32 bit, 64 bit, or even 128 bit values; specific instructions are used to access the larger data items, and the access is made to an even-numbered floating-point register (64 bit values) or to a register number evenly divisible by 4 (128 bit values). The 128 bit path between the data cache and the floating-point registers further supports 128 bit access.
CONTROL REGISTERS
The 860 has six 32 bit control registers and four special-purpose floating-point registers, each of which is used for a specific purpose.
The processor status register (psr) contains information specific to the current process and is described in detail below (see
Figure 2
The extended processor status register (spsr) has some information relating to the current process as well as identifying information for the chip itself, and is described below (see
Figure 2
The data breakpoint register (db) is used for debugging; it contains an address that will trigger a trap if accessed.
Dirbase, the directory base register, determines option settings for address translation, caching and buss access.
The fault instruction register (fir) holds the address of the instruction that triggered a trap.
The floating-point status register (fsr) describes the current state of the floating-point unit.
Special-purpose registers (KR, KI, T, and MERGE) support floating-point dual-operations mode and graphics manipulation operations; these registers and fsr will be described in the second part of this article.
The processor status register has several fields of specific interest in comparing the 860 to CISC chips.
BR (Break Read) and BW (Break Write) enable or disable trapping of accesses to the address sorted in the data break-point register (db), on reads and writes, respectively.
The CC (Condition Code) and IM (Interrupt Mode) fields resemble similar fields on other chips; however, there is only one general condition code flag and one loop flag (LCC), which have different meanings depending on the last instruction executed.
UM (user mode) restricts access to the control registers.
Other single-bit flags are used to save current status when a trap occurs.
SC stores a shift count for use with the double-shift instruction.
PS and PM define the pixel size and pixel mask for use with graphics processing and will be described in the second part of this article.
The extended processor status register contains several fields that will allow a program to determine the exact member of the future 860 family of chips it is running on.
Processor type allows for any of 256 different processors to be specified; this field may be found in other future Intel processors.
Stepping number gives the revision number of the current processor type.
DCS (Data Cache Size) is a 4 bit field that specifies the size of the on-chip data cache; a DCS of 0 specifies a 4 Kbyte cache (as found on the current 860), a 1 would indicate a 8 Kbyte cache up to a possible maximum of 16 for a 64 Kbyte on-chip cache.
BE (Big Endian) specifies the order of bytes within a data item. Normally the first byte in an address is the low-order byte. When BE is set, the first byte is the "big end" of the byte string and is the high order byte in the data item. This allows transparent accessing of data from computers that use big endian format, such as IBM mainframes.
In next month's article, I will discuss the instruction set, the integer unit, the floating-point unit and its associated graphics area, multitasking and multiprocessing support and interrupt handling, among other areas.
Bud Smith is a MIPS technical editor.
FIGURE 1. 860 data and instruction flow
32 bit Address /
\ /
\ 64 bit Memory Bus
\
/
Bus and Cache Control Unit
/
\
\
/ \
/
Data Cache
Instruction Cache
/
\ /
\
\
\
/
Integer Unit
Floating-Point Unit
/
\
/
\
\
/
\
/
Integer Registers
Floating-Point Reg.
FIGURE 2. Processor status register (psr) and extended processor status register (epsr)
0
Break Read *
Break Write *
Condition Code
Loop Condition Code
Interrupt Mode *
Previous Interrupt Mode *
8
User Mode *
Previous User Mode *
Instruction Trap *
Interrupt *
Instruction Access Trap *
Data Access Trap *
15
Floating-Point Trap *
18
Delayed Switch *
Dual-Instruction Mode *
Kill Next Floating-Point Instruction *
Reserved *
17
Shift Count
21
Pixel Size
Pixel Mask
31
31 Overflow Flag
* Big Endian Mode
* Can be changed only * Page-Table Bit Mode
from supervisor level. Data Cache Size
Reserved
* Write-Protect Mode
* Interlock
HARDWARE IMPLEMENTATIONS AND SOFTWARE SUPPORT
At the Intel 860 announcement in February, there were favorable signs for early software support and hardware implementations of the new processor.
The most important hardware news was IBM's demonstration of the MCA board called the Wizard, which was used as a coprocessor in a Model 80 to draw a Mandlebrot curve. The Wizard board should be available later this year as a product for 386-based IBM PS/2's using the MCA bus. Also, 860-based coprocessor boards for the AT bus are expected from AST Research Inc., Number Nine Computer Corp.and Mercury Computer Systems. MegaTek Corp., of San Diego, and Olivetti will be introducing systems incorporating the 860.
There were also promising developments on the software side, AT&T, Olivetti, Prime Computer and Convergent Technologies will be working together to bring a multiprocessor version of the upcoming Unix System V, Release 4, to the 860. This special version of Unix will incorporate Mach features into an otherwise AT&T standard release. Also, the Wizard demonstration on the Model 80 was run under OS/2, which Microsoft officials have indicated will include 860 coprocessor support by next year.
Green Hills and MetaWare will be shipping compilers. Fortran and C are currently available from Green Hills and MetaWare has High C in beta test, with Professional Pascal under development. Other companies announced in-circuit emulators, logic analyzers and program optimizers. Ardent Computer has ported its Dor
3D visualization toolkit to the 860, claiming that 150,000 lines of C were moved over in only a few weeks.
Last, but certainly not least important, Intel is selling an AT-bus development system with a 25 Mhz 386, an 860 coprocessor, several Mbytes of RAM, a large hard disk, tape drive, Ethernet, Unix and development software. These systems are priced at $24,500 per unit; MCA-bus and OS/2 development systems are promised to follow soon.