home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!crdgw1!rpi!usc!howland.reston.ans.net!spool.mu.edu!yale.edu!newsserver.jvnc.net!gmd.de!Germany.EU.net!mcsun!dxcern!dscomsa!vxdesy.desy.de!burke
- From: burke@vxdesy.desy.de (Stephen Burke)
- Newsgroups: comp.sys.acorn.tech
- Subject: ARM code optimisation
- Message-ID: <1993Jan27.194033.1@vxdesy.desy.de>
- Date: 27 Jan 93 19:40:33 GMT
- Sender: usenet@dscomsa.desy.de (usenet)
- Organization: (DESY, Hamburg, Germany)
- Lines: 53
-
-
- I've been reading the ARM data manual and trying to work out what optimisations
- can be made. I've come up with the following; if anyone knows better they'll no
- doubt say so :-)
-
- 1) The MEMC can access memory in either an s-cyle (sequential) or an n-cycle
- (non-sequential). The first takes one clock cycle (to RAM at least), and the
- second takes two clock cycles. You can only have an s-cycle if it's an access
- to an address immediately following the one previously accessed (this isn't
- strictly correct, but it's the way it works out). DMA requests (video update
- etc.) are held up during s-cycles, so the MEMC forces any access on a quadword
- boundary (i.e. byte address divisible by 16, last hex digit zero) to be an
- n-cycle regardless. The basic optimisation is therefore to arrange to have
- accesses which couldn't be sequential anyway occur on a quadword boundary.
-
- 2) For an ARM 2, as far as I can see this basically means that the target
- of a branch, the instruction following a store (but not a load), and the
- first address accessed by an STM or LDM should always be on a quadword
- boundary. You also get maximum speed/word if code segments are a multiple of
- 4 words long, and all LDM/STM instructions transfer a multiple of four
- registers. All instructions start with an opcode prefetch, so I think you
- get no gain from having loads or stores consecutively.
-
- 3) An ARM3 with cache off behaves like an ARM2. This is also true for stores,
- as all writes go "through" the cache (they're also written into the cache if
- appropriate). If a read or an opcode fetch finds it's target in the cache then
- it's read in one (25 MHz or whatever) cycle, regardless of whether it's an n or
- s access, or what the address is. If it isn't in the cache, it reads one cache
- line, which is four words starting on a quadword boundary. The cpu is halted
- until the word it wants has been read. Thus you gain twice by starting code
- segments on quadword boundaries; you don't waste time reading words you don't
- want, and the cpu starts again as soon as the first word is read. You also get
- better cache usage efficiency if code and data chunks are a multiple of four
- words long.
-
- The final effect is fairly small. The cache is organised as four blocks of 64
- 4-word lines. Thus if you read 16 sequential words they're guaranteed to go
- into different locations in the cache. When you read the next word it goes
- randomly into one of the 64 lines in that block, so you have a 1/64 chance that
- it removes the first 4 words you read. This is a low probability, and after a
- few iterations you can usually assume that the whole loop is in the cache
- (unless it's nearly as big as the cache). However, it does presumably mean that
- there's a small advantage to having chunks be a multiple of 16 words.
-
- As far as I can see, this covers all the possible optimisation issues above
- the basic times for the instructions. Does anyone have anything more?
-
- e----><----p | Stephen Burke | Internet: burke@vxdesy.desy.de
- H H 1 | Gruppe FH1T (Liverpool) | DECnet: vxdesy::burke (13313::burke)
- H H 11 | DESY, Notkestrasse 85 | BITNET: BURKE@DESYVAX or SB2@UKACRL
- HHHHH 1 | 2000 Hamburg 52 | JANET: sb2@uk.ac.rl.ib
- H H 1 | Germany | Phone: + 49 40 8998 2282
- H H 11111 | HERA, the world's largest electron microscope!
-