NetNews Usenet Archive 1993 #3

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1993 #3 / NN_1993_3.iso / spool / comp / sys / acorn / tech / 1466 < prev next >

Wrap

Internet Message Format | 1993-01-28 | 3.7 KB

Path: sparky!uunet!crdgw1!rpi!usc!howland.reston.ans.net!spool.mu.edu!yale.edu!newsserver.jvnc.net!gmd.de!Germany.EU.net!mcsun!dxcern!dscomsa!vxdesy.desy.de!burke From: burke@vxdesy.desy.de (Stephen Burke) Newsgroups: comp.sys.acorn.tech Subject: ARM code optimisation Message-ID: <1993Jan27.194033.1@vxdesy.desy.de> Date: 27 Jan 93 19:40:33 GMT Sender: usenet@dscomsa.desy.de (usenet) Organization: (DESY, Hamburg, Germany) Lines: 53 I've been reading the ARM data manual and trying to work out what optimisations can be made. I've come up with the following; if anyone knows better they'll no doubt say so :-) 1) The MEMC can access memory in either an s-cyle (sequential) or an n-cycle (non-sequential). The first takes one clock cycle (to RAM at least), and the second takes two clock cycles. You can only have an s-cycle if it's an access to an address immediately following the one previously accessed (this isn't strictly correct, but it's the way it works out). DMA requests (video update etc.) are held up during s-cycles, so the MEMC forces any access on a quadword boundary (i.e. byte address divisible by 16, last hex digit zero) to be an n-cycle regardless. The basic optimisation is therefore to arrange to have accesses which couldn't be sequential anyway occur on a quadword boundary. 2) For an ARM 2, as far as I can see this basically means that the target of a branch, the instruction following a store (but not a load), and the first address accessed by an STM or LDM should always be on a quadword boundary. You also get maximum speed/word if code segments are a multiple of 4 words long, and all LDM/STM instructions transfer a multiple of four registers. All instructions start with an opcode prefetch, so I think you get no gain from having loads or stores consecutively. 3) An ARM3 with cache off behaves like an ARM2. This is also true for stores, as all writes go "through" the cache (they're also written into the cache if appropriate). If a read or an opcode fetch finds it's target in the cache then it's read in one (25 MHz or whatever) cycle, regardless of whether it's an n or s access, or what the address is. If it isn't in the cache, it reads one cache line, which is four words starting on a quadword boundary. The cpu is halted until the word it wants has been read. Thus you gain twice by starting code segments on quadword boundaries; you don't waste time reading words you don't want, and the cpu starts again as soon as the first word is read. You also get better cache usage efficiency if code and data chunks are a multiple of four words long. The final effect is fairly small. The cache is organised as four blocks of 64 4-word lines. Thus if you read 16 sequential words they're guaranteed to go into different locations in the cache. When you read the next word it goes randomly into one of the 64 lines in that block, so you have a 1/64 chance that it removes the first 4 words you read. This is a low probability, and after a few iterations you can usually assume that the whole loop is in the cache (unless it's nearly as big as the cache). However, it does presumably mean that there's a small advantage to having chunks be a multiple of 16 words. As far as I can see, this covers all the possible optimisation issues above the basic times for the instructions. Does anyone have anything more? e----><----p | Stephen Burke | Internet: burke@vxdesy.desy.de H H 1 | Gruppe FH1T (Liverpool) | DECnet: vxdesy::burke (13313::burke) H H 11 | DESY, Notkestrasse 85 | BITNET: BURKE@DESYVAX or SB2@UKACRL HHHHH 1 | 2000 Hamburg 52 | JANET: sb2@uk.ac.rl.ib H H 1 | Germany | Phone: + 49 40 8998 2282 H H 11111 | HERA, the world's largest electron microscope!