home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!charon.amdahl.com!amdahl!rtech!sgiblab!spool.mu.edu!agate!doc.ic.ac.uk!uknet!mucs!cs.man.ac.uk!endecotp
- From: endecotp@cs.man.ac.uk (Phil Endecott)
- Newsgroups: comp.sys.acorn.tech
- Subject: Re: ARM risc speed?
- Message-ID: <endecotp.728134671@cs.man.ac.uk>
- Date: 27 Jan 93 11:37:51 GMT
- References: <1993Jan20.151326.23097@infodev.cam.ac.uk> <1993Jan20.163935.29452@dcs.warwick.ac.uk> <1993Jan21.114022.5930@cs.nott.ac.uk> <2804@eagle.ukc.ac.uk>
- Sender: news@cs.man.ac.uk
- Lines: 93
-
- spt1@ukc.ac.uk (S.P.Thomas) writes:
-
- >In article <1993Jan21.114022.5930@cs.nott.ac.uk> smb@cs.nott.ac.uk (Simon Burrows) writes:
- >>A while ago Acorn sent out some guidelines on code sequences which should
- >>no longer be used if compatibility with future processors is to be maximised.
- >>They were issued before the ARM250 was brought out, so probably apply to that?
-
- >I am very interested in this,
- >as I've built an, um "experimental" lazy functional language compiler
- >that generates ARM assembler as its target code.
-
- You probably don't have anything to worry about. Presumably your code runs
- only in user mode; the principle restriction is about accessing banked
- registers after a mode change, and this doesn't affect user mode code.
-
- >In a similar vain, one of the optimisations I do is to cause the
- >code to be generated such that the starting addresses of the
- >code sequences that can be jumped to are aligned to have an address
- >of &XXXXXX4. This maximises the number of consequtive (sp?) sequential
- >memory cycles when using MEMC1a. I have three main questions.
-
- As well as reducing the number of non-sequential MEMC accesses, this will
- improve the performance of the ARM3's cache, as cache lines are aligned
- quadwords. If you jump to a non-aligned location, the cache will be loaded
- with the preceeding words from the alignment boundry. If these don't
- contain useful code you are wasting that part of the cache.
-
- The DEC Alpha architecrture handbook has an interesting diagram showing
- that half of the code brought into the VAX cache when running LINPAC is
- never executed for this reason. They suggest that as well as aligning
- branch targets, rarely executed code should be put out-of-line. They
- indicate that the compiler should be able to profile its code to determine
- which half of each if-then-else construct is the more frequently executed
- part; this also helps to reduce the number of branches that a program
- encounters. Could your functional language compiler do this ?
-
- >1) The effect of this alignment is quite significant on an ARM2
- >machine. Would the effect be as significant on an ARM3 (or ARM250,
- >although I suspect the answer is yes, in this case)?
-
- Yes on the ARM3; probably more so than on the ARM2 for the reason mentioned
- above. The ARM250 will behave exactly as the ARM2+MEMC.
-
- >2) In the next generation of memory controllers, are different rules
- >likely to apply? If so (I suspect this is highly likely), can anyone
- >give me an idea what they might be?
-
- No-one knows, but the variables that can be adjusted include the sequential
- burst length and the cache line length.
-
- >3) Is it possible that certain orderings of instructions are "better"
- >than others (ie, faster), even though they achieve the same effect? For
- >example
-
- > ADR ad1,blk1 ADR ad1,blk1
- > LDMIA ad1,{r0-r7} compared ADR ad2,blk2
- > ADR ad1,blk2 with LDMIA ad1,{r0-r7}
- > STMIA ad1,{r0-r7} STMIA ad2,{r0-r7}
-
- On all the current ARMs, these instruction sequencies will take exactly the
- same time. The ARM pipeline is quite simple with single stage execute and
- even dependencies between adjacent instructions don't slow it down.
- However if in the future an ARM was built with a more sophisticated
- pipeline like the Alpha or the MIPS chips have or with a Harvard
- architecture (separate instruction and data memory ports), then sequencies
- that avoid dependencies would run faster.
-
- On the ARM600 and 610, the write buffer does have an impact on the speed of
- some sequencies. For example :
-
- ADR ad1,blk1 ADR ad1,blk1
- STMIA ad1,{r0-r15} ADR ad2,blk2
- ADR ad2,blk2 STMIA ad1,{r0-r15}
- STMIA ad2,{r0-r15} STMIA ad2,{r0-r15}
-
- Although the write buffer very rarely fills in 'typical' code, if you tried
- to run either of these code sequencies it would fill up because you are
- trying to transfer 32 registers in 4 instructions. You can reduce the
- effect of this by spreading out the STMs as far as possible; the left hand
- example will run slightly faster.
-
-
- >These sort of sequences occur a great deal in the code I'm generating.
- >If there is a difference, what are the rules?
-
- There is no effect on the ARM2, ARM3 or ARM250.
-
-
- >Keep well,
-
- >Stephen Thomas
-
- --Phil.
-