NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / sys / acorn / tech / 1454 < prev next >

Wrap

Internet Message Format | 1993-01-28 | 4.7 KB

Path: sparky!uunet!charon.amdahl.com!amdahl!rtech!sgiblab!spool.mu.edu!agate!doc.ic.ac.uk!uknet!mucs!cs.man.ac.uk!endecotp From: endecotp@cs.man.ac.uk (Phil Endecott) Newsgroups: comp.sys.acorn.tech Subject: Re: ARM risc speed? Message-ID: <endecotp.728134671@cs.man.ac.uk> Date: 27 Jan 93 11:37:51 GMT References: <1993Jan20.151326.23097@infodev.cam.ac.uk> <1993Jan20.163935.29452@dcs.warwick.ac.uk> <1993Jan21.114022.5930@cs.nott.ac.uk> <2804@eagle.ukc.ac.uk> Sender: news@cs.man.ac.uk Lines: 93 spt1@ukc.ac.uk (S.P.Thomas) writes: >In article <1993Jan21.114022.5930@cs.nott.ac.uk> smb@cs.nott.ac.uk (Simon Burrows) writes: >>A while ago Acorn sent out some guidelines on code sequences which should >>no longer be used if compatibility with future processors is to be maximised. >>They were issued before the ARM250 was brought out, so probably apply to that? >I am very interested in this, >as I've built an, um "experimental" lazy functional language compiler >that generates ARM assembler as its target code. You probably don't have anything to worry about. Presumably your code runs only in user mode; the principle restriction is about accessing banked registers after a mode change, and this doesn't affect user mode code. >In a similar vain, one of the optimisations I do is to cause the >code to be generated such that the starting addresses of the >code sequences that can be jumped to are aligned to have an address >of &XXXXXX4. This maximises the number of consequtive (sp?) sequential >memory cycles when using MEMC1a. I have three main questions. As well as reducing the number of non-sequential MEMC accesses, this will improve the performance of the ARM3's cache, as cache lines are aligned quadwords. If you jump to a non-aligned location, the cache will be loaded with the preceeding words from the alignment boundry. If these don't contain useful code you are wasting that part of the cache. The DEC Alpha architecrture handbook has an interesting diagram showing that half of the code brought into the VAX cache when running LINPAC is never executed for this reason. They suggest that as well as aligning branch targets, rarely executed code should be put out-of-line. They indicate that the compiler should be able to profile its code to determine which half of each if-then-else construct is the more frequently executed part; this also helps to reduce the number of branches that a program encounters. Could your functional language compiler do this ? >1) The effect of this alignment is quite significant on an ARM2 >machine. Would the effect be as significant on an ARM3 (or ARM250, >although I suspect the answer is yes, in this case)? Yes on the ARM3; probably more so than on the ARM2 for the reason mentioned above. The ARM250 will behave exactly as the ARM2+MEMC. >2) In the next generation of memory controllers, are different rules >likely to apply? If so (I suspect this is highly likely), can anyone >give me an idea what they might be? No-one knows, but the variables that can be adjusted include the sequential burst length and the cache line length. >3) Is it possible that certain orderings of instructions are "better" >than others (ie, faster), even though they achieve the same effect? For >example > ADR ad1,blk1 ADR ad1,blk1 > LDMIA ad1,{r0-r7} compared ADR ad2,blk2 > ADR ad1,blk2 with LDMIA ad1,{r0-r7} > STMIA ad1,{r0-r7} STMIA ad2,{r0-r7} On all the current ARMs, these instruction sequencies will take exactly the same time. The ARM pipeline is quite simple with single stage execute and even dependencies between adjacent instructions don't slow it down. However if in the future an ARM was built with a more sophisticated pipeline like the Alpha or the MIPS chips have or with a Harvard architecture (separate instruction and data memory ports), then sequencies that avoid dependencies would run faster. On the ARM600 and 610, the write buffer does have an impact on the speed of some sequencies. For example : ADR ad1,blk1 ADR ad1,blk1 STMIA ad1,{r0-r15} ADR ad2,blk2 ADR ad2,blk2 STMIA ad1,{r0-r15} STMIA ad2,{r0-r15} STMIA ad2,{r0-r15} Although the write buffer very rarely fills in 'typical' code, if you tried to run either of these code sequencies it would fill up because you are trying to transfer 32 registers in 4 instructions. You can reduce the effect of this by spreading out the STMs as far as possible; the left hand example will run slightly faster. >These sort of sequences occur a great deal in the code I'm generating. >If there is a difference, what are the rules? There is no effect on the ARM2, ARM3 or ARM250. >Keep well, >Stephen Thomas --Phil.