The current range of ARM based computers (Archimedes A410/1, A420/1, A440/1, R140 and BBC A3000) all use the Acorn-designed ARM2 VLSI chip, a small, cheap and fast RISC microprocessor, running at a clock speed of 8MHz. In the laboratory ARM2 has run at 20MHz by using very expensive static RAM, but the requirements of the commercial market limit the current machines to running ARM2 at 4 or 8MHz with ordinary memory systems built from 120nS dynamic RAM.
ARM3 is designed to run at higher speeds while using standard memory. It can do this because of the typical behaviour of a program: instructions and data are re-used many times (for example in loops) in order to complete a task. By copying instructions and data into high speed memory as they are first encountered, the later references will run faster. The high speed memory where these copies are placed is called a cache (pronounced "cash").
ARM3 contains within it an enhanced ARM2 processor plus 4Kbytes of cache memory built onto the same silicon chip. This may not seem very much compared to the 4Mbytes of memory inside the A440, but it is enough to provide performance benefits on all programs because of the generally small nature of program loops. Being smaller than the system's main memory, the cache memory has, in fact, to remember two things: the instruction or data itself, and which address it came from: this latter piece of information is referred to as the tag just like the address tag (label) on a parcel.
When the processor requires information, the tags are checked to see if the information is in the cache: if it is, the associated data is provided, otherwise an external memory cycle is used. To reduce the size of the structures used by the tags, each tag guards four 32-bit words of data (so 256 tags are needed for the 4K cache on ARM3). The ARM3 thus reads four words into the cache even if the programmer specified just one byte. This is actually another advantage: the adjacent locations (which will probably be referred to subsequently in any case) have already been copied into the cache by the time the program gets round to using them. And four-word transfers are advantageous with the MEMC memory controller - reading the three extra values takes a little over twice as long. Each data write is checked against the entries in the cache, and if a match is found, the information in the cache is updated.
An interesting part of the design is the way the position of a new entry in the cache is selected. As new information is copied from main memory to the cache, something currently in the cache must be overwritten with the new data. For ARM3 this is selected on a purely random basis! Simulations have shown that more intelligent algorithms (such as discarding the least recently used entry) do no better than a random one, and would sometimes do much worse.
The cache is caching virtually addressed data (since the translation of virtual addresses to physical addresses occurs inside the MEMC chip) and so has to have an additional state to deal with information that the supervisor mode can read but the user mode can't. It must also be cleared when the operating system, whether RISC OS or RISC iX (on the R140 UNIX machine) addresses a new task. And it must avoid placing values which change by themselves (e.g. in the input/output system) in the cache.
For efficiency it is also a good idea for the chip to be capable of not caching various things: for example the screen memory is a poor subject for caching, being written first and read second (if at all), and being decidedly larger than the cache. And there are areas readable at two virtual addresses (like the screen and the physically addressed RAM), and even places where reads and writes are to different devices (e.g. read the ROM and write the MEMC control registers).
To support all of these miscellaneous facilities there are some cache control registers in ARM3 (which only the supervisor can change). These three registers split the memory into 2Mbyte sections providing 3 options:
1. Uncachable: set I/O areas, doubly mapped DRAM etc. to uncachable
2. Updatable: writes are copied to the cache
3. Disruptive: writes will invalidate the cache
By setting these control bits appropriately, RISC OS and RISC iX systems can be run totally compatibly on the ARM3 with no changes to the applications or the operating systems. When reset, ARM3 disables its cache, thus allowing these operating systems to be started as if ARM3 were an ARM2, a cache initialisation program can then be run to turn the cache on.
In addition to "simply" going faster, ARM3 provides some new facilities. The first is to do with the co-processor bus: if all the accesses to memory are fulfilled by the cache, how does the co-processor (currently attached to the data bus between ARM2 and MEMC) function? ARM3 has a separate co-processor bus to which existing co-processors can be attached.
The second facility is more interesting: a single ARM3 uses only 1/4 to 1/2 of the data bus bandwidth provided by MEMC. If the video system isn't using it, why not add another ARM3? (though its not as simple as it sounds!) If one is to use multiple ARM3s they need a secure method of communication which will return a valid answer even if all the processors try to alter the same location simultaneously. Ordinary LDR/STR communication is not secure: if one processor moves a value to a register, adds one to it and writes it back, another processor might have written information to the same location in the interim. ARM3 has a LOCK signal added to its bus interface and an instruction which uses it: SWP swaps a byte or word between register and memory, doing the load and store with LOCK asserted so that other processors will not gain access to the bus in between these operations. This enables a particular processor to set a flag (and be sure that the other processor has not simultaneously done the same thing) as an indication that a certain area of memory should not be accessed by the other processor.
Process | 1.5um DLM CMOS |
Chip size | 8.72mm x 9.95mm |
No. Transistors (RAM/CAM) | 309,656 (206,454 / 62,973) |
Cache size | 4Kbytes |
Cache clock speed | 25MHz |
Performance (120ns DRAM main memory) | 25 MIPS peak 12 MIPS sustained |
Power consumption | 1 Watt |
Package | 160 pin Quad Flat Pack |
No. Power pins | 41 |
Obviously, with all these extra signals (and others like the high speed clock which runs the processor core), ARM3 is not pin-compatible with ARM2! It has 160 pins (including 41 connected to the power lines) and is in a quad flat package with 40 leads a side. ARM3 is, however, signal compatible with the existing chip set so that it can use MEMC, VIDC, IOC and the FPPC (floating point protocol converter) in new designs. ARM3 has been tested on 1.5 micron double level metal CMOS process where its 309,656 transistors use an area 8.72 by 9.95 mm. Chips run between 20 and 30MHz dissipating up to one Watt of power. However, it will be reduced to a 1 micron process before high volume production, and this reduction in size should allow an increase in maximum clock rate.
So how fast does it go? It's difficult to say since it does depend on so many things: screen mode, ARM3 clock speed, program, operating system.... For a 20MHz ARM3 running the Dhrystone 2.1 program under RISC OS in mode 0, the speed is increased by a factor of 3. But in higher screen modes like mode 20, the ARM3 hardly slows down at all, while the ARM2 runs at only 50-60% of its original speed, so the relative improvement is even greater. And if you spend a lot of time in mode 21..... Conversely a simple mode like mode 18 and a different program - Smalltalk Express' Smalltalk-80 - gives only a 1.3 - 1.5 times improvement. In a graphical environment like RISC OS's Desktop, type faces, scrolling, panning, sprites all cache very well and result in a distinctly "light feeling" to all the objects: they move around very easily and quickly.
Although all programs running under RISC OS or RISC iX will run on ARM3, there are some things that can be done to optimise the performance on ARM3, usually without harming the performance on ARM2:
1. Don't unroll loops too many times. A common technique to speed up execution of loops is to "unroll" them, replicating the body of the loop several times and just having one end of loop statement. On an uncached processor, this will clearly speed things up since fewer statements get executed. On a cached processor the algorithm fills more cache space than necessary (thus discarding other useful information) and re-uses the items fewer times.
2. Don't use STR, STRB. On the cached processor, store instructions always halt the processor while they go out to memory. The normal three cycle cost of a store is, therefore, greatly magnified, becoming 6 or more cycles for a 20MHz processor, because of the relatively slow speed of off-chip RAM access. Collect store instructions together, replacing STRBs with STR, and STRs with STM. Load instructions also suffer, but at least they have the chance of being in the cache. I recently re-wrote the store floating point algorithm in BBC BASIC, which has to store 5 bytes in consecutive locations, to make it ARM3 friendly. Before the change it took 20 stores, 0 loads and 12 data operations to do four of them. Afterwards it took 10 stores, 2 loads and 32 data operations. The changed algorithm is slightly faster for an ARM2 and much faster on an ARM3.
3. Try to increase the locality of the code. Group the code by use: put together all the bits which operate at the same time. Use common sub-routines to concentrate critical sections into the cache.
Many similar techniques will occur to the proficient programmer!
Since ARM3 is relatively new there are only a few places where you can find out more. VLSI Technology Inc (Milton Keynes) will be selling the chip as VL86C020 and will have data manuals etc. Acorn and VLSI Technology Inc have provided papers to various conferences this year describing the implementation of the chip. A book describing RISC generally and ARM2 and ARM3 in particular VLSI RISC Architecture and Organisation by S. Furber; published by Marcel Dekker, Inc, ISBN 0-8247-8151-1. Price $119.50, unfortunately, but some of you must be able to buttonhole a friendly librarian!