Programmer 7500

home *** CD-ROM | disk | FTP | other *** search

/ Programmer 7500 / MAX_PROGRAMMERS.iso / PROGRAMS / UTILS / HARDWARE / 860MIP.ZIP / J860MIP (.txt) next >

Wrap

Q&A Document | 1989-05-26 | 32.3 KB | 168 lines

May 26, 1989 4:01 pm Q&A WRITE -- B. Emerald A THE ARCHITECTURE A OF THE A INTEL 860 A Part II Last month, I discussed the overall architecture of Intel's newly announced 860 RISC processor and its caching capabilities, performance and register set. In this second half of the article I'll discuss the 860's instruction set as it relates to integer, floating-point and graphics operation and take a look at the processor's pinout and data types. I'll also examine implementation issues, such as multitasking and multiprocessing support, that affect how the 860 will be used in systems and conclude with a look at the immediate prospects for the chip. THE INTEGER RISC CORE AND INSTRUCTIONS The 860 actually has three instruction sets, corresponding to the three units on the chip -- the integer core, the floating-point unit and the graphics unit. Usually, only one unit will be running at any given time. In some cases during floating-point operation, an integer instruction can be run at the same time; this must be planned for by the programmer or by the compiler and requires careful implementation, as described below. The integer core is a true RISC architecture and nearly all integer instructions take only one cycle -- an impressive feat for a chip running at 33 Mhz. the instruction set is deliberately kept simple to support 1-clock execution. Load/store architecture, in which only explicit load and store operations access memory and other instructions operate on registers, is crucial to 1-clock performance. There are related strictures on programming. For example, an instruction should not use a destination by the previous instruction, or a 1-clock delay will automatically result. (On some early RISC chips, the delay is not automatic and it's up to the compiler to rearrange code or insert a NOP to prevent problems.) Unlike instructions for CISC processors with their small number of registers, a RISC instruction such as an add takes three operands -- source one, source two and destination. (Adds for CISC processors typically put the result in one of the source registers.) The three operand design keeps the processor from having to read and write the same register within a single clock. On any architecture, it is difficult to get branch instructions to take fewer than 2 clocks. The traditional RISC solution to this problem is to have the processor execute the instruction below the branch, whether the branch is taken or not, and only then perform the flow of control change if required. (A smart compiler can "fill the hole" with a useful instruction about 75 percent of the time, according to Tom Pennello of MetaWare. A less smart compiler will always insert a NOP after a branch.) the Intel 860 can also execute the instruction only if the branch is taken and skips it otherwise (costing a clock); the programmer or compiler attempts to structure the code so that the instruction to be executed is on the most-used side of the branch. A summary of the 860's instruction set is shown in Table 1 . The processor also has additional mnemonics that are variants of the instructions listed in the table. The 860's instructions set is complete by RISC standards with a few exceptions. String moves (memory to memory transfers), though not common on RISC architectures, would be a welcome feature in a chip that will operate on large data areas for graphics and other functions. The860 has no instructions to support fast register set saves or restores, needed for quick context switching. Such instructions are a main feature of the 80386 architecture. Several standard instructions for integer arithmetic are carried out in the FPU instead of in the integer core. This seems sensible but can incur overhead when results must be moved from integer to floating-point registers before the floating-point instruction and moved back to the integer registers afterward. There is no add or subtract with carry; no integer multiply; no integer divide, which is executed by software using the floating-point reciprocal function; and no integer compare. A compare is done by adding or subtracting two numbers, which sets or clears the CC bit and then discarding the result by sending it to R0 (register 0 is always all 0's and ignores anything written to it). This procedure works but is somewhat opaque to a programmer reading someone else's code or examining compiler output. In fact, such idiosyncrasies and the difficulty of implementing dual-instruction mode and pipelined floating-point operations (described last month) led Brian Case, writing in the March issue of Microprocessor Report , to conclude that "programming the i860 for maximum performance in assembler is not for the faint of heart." Optimizing compilers will be very important in taking full advantage of the chip. Addressing modes for the 860 avoid the increasingly arcane (if occasionally useful) variety found on CISC chips in favor of four options: OFFSET, REGISTER, OFFSET+REGISTER and REGISTER+REGISTER. Since memory is accessed only through load and store instructions, understanding how a program interacts with memory and the cache will be much simpler than with CISC chips. The load and store instructions do, however, have auto-increment options for efficient vector-type operations and memory fills. THE FLOATING-POINT UNIT The 860's FPU can operate in several different ways. The most straightforward is single-instruction, scalar operation, in which the integer unit executes standard CPU instructions and the FPU executes most math instructions one at a time. In this mode, adds and multiplies take 3 clocks (4 for double-precision multiplies), and the programmer or compiler writes fairly standard code. Because of the high clock speeds at which the 860 runs, the efficient RISC implementation, and the placement of the cache, CPU, and FPU all on one chip, program execution in this simple mode should be quite fast -- faster, for instance than an 80486 running at the same clock speed (see the article in the J486MIP document). Pipelining is the most important optimization that can be performed for floating-point instructions, since it cuts a 3 clock add or multiply to 1 clock. A dual-precision, 4 clock multiply is cut to 2 clocks. During pipelined operation, a result from an old instruction will be ready at the same time the first operand for the newest instruction is being loaded from a register. So a pipelined instruction specifies two source operands for itself and a destination for the "old" result coming out of the pipeline. Each instruction counts on a subsequent instruction to store its result correctly up to three instructions later. Different operations have different pipeline lengths--an add, for instance, has a three-stage pipeline. Because the pipeline must be cleared before use and results must be stored correctly during and after pipelined operations, pipelining is effective only for repetitive operations on large arrays of data, such as those found in geometric transformations. Successful implementation of pipelined operations does impose a burden on the programmer and, for high-level languages, the compiler. When pipelining is in use, dual-operations instructions can also be used. These instructions run the FPU adder and multiplier units simultaneously. Normally, the adder and multiplier each require two source operands and a destination, for a total of six operands. In dual-operation mode, only three operands are specified, and the KI and KR intermediate registers are used as additional inputs to the multiplier, with the T intermediate register holding a multiply result and an adder input. In dual-operation mode, the programmer needs to be concerned about making best use of the 8 Kbyte data cache, since cache misses will adversely affect performance. More complications can be added by, for instance, trying to mix single and double-precision operands, which have different pipeline lengths--something the i860 Programmer's Reference Manual describes as "for the adventuresome." Successfully programming dual-operation instructions is a challenge, but the reward is that two floating-point results are yielded per clock. This optimization is worthwhile for large or repetitively processed blocks of data going through one or more transformations. Another optimization is to execute integer instructions simultaneously with floating-point instructions by using the dual-instruction mode. Dual-instruction mode runs the CPU and FPU simultaneously. Dual-operation mode, available only for pipelined instructions, runs the FPU adder and FPU multiplier at the same time, as described above. Both "dual" modes can be in effect at once. If the assembler directive .dual is encountered, or the instruction prefix is used, the subsequent instructions has its D bit set, and after one more single instruction executes, dual-instruction mode, 64 bits of code are loaded at once, with the first 32 bits going to the FPU and the remainder to the integer unit. This continues until the .enddual assembler directive is encountered or a floating-point instruction without the prefix is found, in which case the D bit is cleared. One more pair of instructions is executed together, then the processor returns to fetching 32 bits of code at a time, sending it either to the FPU or CPU, depending on the nature of the instruction. TABLE 1. Instructions for the 860's integer unit, floating-point unit and graphics unit. Integer Unit Load integer 8/16 bit, sign-extended 32 bit Store integer 8/16/32 bit Load floating-point 32/64/128 bit Store floating-point 32/64/128 bit Store pixel 8/16/32 bit pixel in 64 bit word Shift 32 bit left,right, right arithmetic Shift 64 bit right with 32 bit result Logical 32 bit and/or/nor/and not Add/subtract 32 bit signed/unsigned Transfer to floating-point reg. 32 bit Branch conditional delayed/not delayed unconditional delayed Compare and branch not delayed = or Loop update and test counter Flush cache Load/store control to/from 323 bit control register Trap Call/return delayed transfer Call indirect Interrupt on overflow Floating-Point Unit Add/subtract single/double Multiply single/double Compare single/double Transfer to internal register 32 bit Reciprocal divide step single/double Reciprocal square root step single/double Round integer single/double Truncate to integer single/double Multiply low 53x53 integer multiply with 53 bit result Add/subtract with multiply single/double Graphics Unit Add/subtract 64 bit integer Or to merge 64 bit integer Z-buffer add 16/32 bit pixels Z-buffer check 16/32 bit pixels Pixel 8/16/32 pixels GRAPHICS UNIT The graphics unit operates on 8, 16 or 32 bit pixels, although all data is handled 64 bits at a time. The sidebar on the last page describes the 860's potential for use as a graphics processor. The pixel size is determined by a 2 bit field in the Processor Status Register, Pixel Size (PS). PS values 00, 01 and 10 specify pixel sizes of 8, 16 and 32 bits; the value 11 is undefined. Pixel Mask (PM) is an 8 bit field that controls which pixels are actually stored by the 64 bit pixel-store instruction pst.d . If 8 bit pixels are used (PS=0), then each bit in PM specifies whether a byte-sized pixel is stored (bit = 1) or not (bit = 0). For 16 or 32 bit pixels, only the lower 4 or 2 bits, respectively, of PM are examined. The PM field can also be set by use of some of the graphics functions. Figure 1 shows the available pixel formats. Figure 1. The 860's pixel formats. The labels shown are examples only. 7 5 0 8 Bit pixel 16 Bit pixel 15 9 3 0 32 Bit pixel 31 23 15 7 0 I = Intensity, R = Red, G = Green B = Blue, C = Color, T = Tint The graphics unit is supported by the FPU for very fast operation. Rotational transformations of 3D wireframes or solids are among the computation-intensive routines that can benefit from pipelined floating-point operations, as well as dual-operation and dual-instruction modes. PINOUT The 860 processor has 168 connecting pins, grouped as follows: 64 data pins for the 64 bit data bus 37 address pins--29 address-bus, 8 byte-enables--to support 32 bit addressing 6 bus-interface pins, including bus-lock and next-near 2 cache-interface pins, including one to disable caching 6 execution-control pins--only one interrupt pin is used 3 test pins for component and board testing 2 configuration pins reserved by Intel 24 power and 24 ground pins The 64 data pins (D63-D0) are not independent but are controlled by eight 8 bit 74AS646 latching transceivers that can store data between one cycle and the next. A noncached data write, for instance, goes to write buffer in the bus and cache control unit, is latched at the data pins, then goes out to the bus. The 29 address pins (A31-A3) can select any 64 bit location; the 8 byte-enable pins (BE7#-BE0#) specify which bytes are accessed with the 64 bit location. Cache reads return a 64 bit quantity and ignore the byte-enable signals. The bus-lock (LOCK#) pin supports the use of shared memory in multiprocessor systems by locking the bus from other processors while the asserting processor reads. modifies then writes back the data. The bus-lock pin prevents the situation in which one processor reads a data item that is being modified by another. The cache-enable pin (KEN#) is asserted if data being read is on a page that can be cached; information in shared memory can't be cached, as writes by other processors could render the cache stale. Pages used to support I/O also should no be cached unless the routine that handles the trap can maintain cache coherency (usually by flushing the cache). The next-near (NENE#) pin supports the use of static-column RAMs. This pin is asserted whenever a memory access is on the same DRAM page as the previous access. The sidebar on the last page explains in some detail how the next-near pin works with RAM and the bus to support rapid sequential RAM accesses. 860 DATA TYPES The 860's data types are similar to the 80386's, except that it has no 80 bit numeric type and has added 8, 16 and 32 bit pixel data types that can also be used to support nongraphics operations. In the 860, 32 bit integers represent values between from -2 to 2 -1. 8 bit and 16 bit integers must be stored in a 32 bit register and sin-extended before being operated on. 32 bit ordinals (unsigned numbers) represent positive numbers between 0 and 2 -1. 64 bit integers and ordinals are supported by add and subtract instructions; the programmer must write code to support long integer and long ordinal multiplies and divides. The FPU supports IEEE-standard single precision (64 bits) reals. The high bit in each case is the sign bit; single-precision has an 8 bit exponent and a 23 bit fraction, while double-precision has an 11 bit exponent and a 52 bit fraction. A 64 bit double-precision real is stored with the low bits in an even-numbered register and the high bits in the next higher odd-numbered one. To implement divides, an on-chip reciprocal function must be combined with a routine in software. MULTITASKING AND MULTIPROCESSING The 860 is not designed as a multitasking chip in the same way as, for instance the 80386. There is no fast register save to quickly save the current processor status to memory and the data cache must be flushed before a new process can be started; that is, "dirty" or written-to locations in the write-back cache must be saved to main memory. If a large number of cache locations have been written to since the last cache flush, this may take many cycles. During the register save and cache flush, interrupts must be disabled, increasing the likelihood that interrupts will stack up. In cases where the 860 functions as the CPU rather than as a math or graphics coprocessor, "smart" disk controllers, serial communications processors and so on may be required to reduce the number of interrupts the chip will have to service. By virtue of its 386-compatible memory management and compatible data types, the 860 is well designed for a multiprocessing system in which an 80386 does not have bus snooping to preemptively spot updates to shared memory, nor does it have protection against self-modifying code like the 80486. There are no specific instructions for accessing shared memory in a single, protected operation, only the LOCK signal to prevent all outside access to the bus. The lack of these features makes the chip less well suited for "1" to "n" multiprocessing, in which multiple 860 chips gang up to tackle a problem. THE OUTLOOK Several conclusions can be drawn from statements made by various parties interested in the 860. The 860's support for paging (as used by Unix) but not for segmentation (as used by DOS and current versions of OS/2) imply that it will not be an effective CPU for "quick and dirty" ports of current IBM PC-type software. It will, however, be an excellent coprocessor for vector operations and for graphics applications currently found only on graphics supercomputers. The fact that the 860 has been shown running as a graphics coprocessor under a modified version of OS/2 demonstrates this capability. The 860 chip is well suited for use as a stand-alone processor for a graphics workstation running Unix if incoming interrupts aren't excessive and compatibility with current IBM PC-type software is not a requirement. The upcoming multiprocessor version of Unix for the 80386, 80486 and 860 will support this type of application and the 860-based workstation could be tied into an 80486-based file server that could run DOS applications and OS/2 LAN Manager across a network. Aggressive plans for developing a multiprocessor Unix are being pursued by a group of companies, including AT&T, Olivetti and Prime Computer. Intel has been very helpful in providing MIPS with interviews and early access to information about the 860. However, at this writing there do not seem to be many chips in developers' hands. This indicates that coprocessor boards and systems using the 860 may not be generally available until next year--a conjecture we'd be happy to be proved wrong on. By Bud Smith Technical Editor MIPS June 1989 THE 860 AS GRAPHICS PROCESSOR General purpose graphics processing is no the Intel 860's cup of tea. Its highly touted graphics capabilities are tightly focused on 2D and 3D modeling, as used in computer-aided manufacturing, molecular modeling, and animation. Modeling requires intensive floating-point processing. In 3D, it is necessary to determine if a particular object to be drawn is already obscured by another object closer to the viewer. One technique that can be used to accomplish this is Z-buffering. The 860 has instructions to aid this process. The chip also has pipelined floating-point processing, which speeds calculations of 2D and 3D transformations, and commands to handle color interpolation for Gouraud shading of object. Gouraud shading provides for more realistic coloring of 3D solids. Intel claims the 860 can render 50,000 Gouraud-shaded triangles per second, which is a phenomenal speed when you consider that the Matrox SM series of high-performance PC graphics boards achieves a rate of only about 20,000 Gouraud-shaded triangles per second. Some of the 860's speed in this application is due to the fact that the chip does all of its shading and Z-buffer operations 64 bits at a time. This means it can shade four 16 bit pixels or do four Z-buffer checks at a time. Typical graphics processors, such as the TI 34010, TI 34020, Intel 82786 and IBM 8514/A, don't have built-in support for modeling, but it can be added through hardware external to the graphics processor to provide fast transformation calculations. Z-buffering and Gouraud shading can also be handled by software, either on the graphics chip or on the external processor. Naturally, the resulting rendering speed might be slowed by this approach. But these processors do have some significant advantages over the 860 with respect to general-purpose graphics. All of them support hardware line drawings, rectangular fills, source/destination raster operations, clipping and BitBLTs (bit block transfers), as well as pixel sizes from 1 bit up to 16 or even 32 bits (in the case of the 34020). These operations are important for applications like desktop publishing, low-end CAD, presentation design and paint systems. The 860 does not support these operations in hardware and it is not necessarily easy to implement them in software. Also, the 860 supports only pixel sizes of 8, 16 or 32 bits per pixel, all of which can be effective in solid modeling but not in some other graphics applications. Additionally, the 860's larger pixel size means that greater amounts of memory are required for image display even in minimum configurations. The graphics processors also have another advantage over the 860: a software base. This is understandable, since the 860 is only now becoming available. But I think we'll see a surprising number of developers port their modeling software to an 860-based platform. General graphics applications might run at truly blazing speeds on a system using an 860 to do solids and endpoint generation and a lower-end processor to handle line drawing. Another issue related to pixel sizes is that of displays. Of the graphics processors mentioned earlier, all support the use of VRAM (video RAM). VRAM is a form of dual-ported memory that allows a display processor to shift information out to display without imposing the wait states that would occur with normal single-ported DRAM designs. The 860 does not explicitly support VRAM. Instead it treats VRAM as regular DRAM and relies on an external display processing unit to manage the display and the shifting of data to the display. Fortunately, this is not expensive to do and it helps differentiate the 860 from typical graphics processors. Another area in which the 860 differs from the norm is the definition of 16 bit pixels. The current PC graphics standards for 16 bit pixels dictate that the pixel is broken down into 5 bits of red, 5 of green, 5 of blue and 1 for other uses. The 16 bit pixel is traditionally used for video capture and scanning, because its 32,768 color levels cover just about every shade the eye can detect. For 16 bit pixels, the 860 processor uses 6 bits for red, 6 for green and 4 for blue, because the human eye's blue response is substantially lower than its red and green responses. What all this boils sown to is that while the 860 does not support the wide functionality of a normal graphics processor, it does excel at modeling due to its fast floating-point support and the additional transistors within it that are dedicated to graphics. According to Jack Grimes at Intel, for programs that take advantage of the graphics capabilities, modeling 3D solids is about ten times faster with the 860's internal graphics support and would be the case without it. By Jake Richter President of Panacea, Inc. F1-I75