NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / sys / sgi / misc / 126 < prev next >

Wrap

Text File | 1992-12-31 | 6.8 KB | 167 lines

Newsgroups: comp.sys.sgi.misc Path: sparky!uunet!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!venezia!penev From: penev@venezia (Penio Penev) Subject: Re: Where does the CPU go, when its not doing nuthin' (gr_osview) References: <1992Dec27.213821.29605@sol.ctr.columbia.edu> Sender: nobody@ctr.columbia.edu Organization: Rockefeller University Date: Thu, 31 Dec 1992 12:25:15 GMT X-Newsreader: TIN [version 1.1 PL6] Message-ID: <1992Dec31.122515.12993@sol.ctr.columbia.edu> Reply-To: penev@venezia.rockefeller.edu X-Posted-From: venezia.rockefeller.edu NNTP-Posting-Host: sol.ctr.columbia.edu Lines: 151 Penio Penev (penev@venezia) wrote: : ForthNet articles from GEnie (ForthNet@willett.pgh.pa.us) wrote: : : Category 9, Topic 2 : : Message 127 Sat Dec 26, 1992 : : ELLIOTT.C at 13:40 EST : : : : -----via CRS Premium Bulletin Board - : : USR Dual Standard 16.8K (416) 629-7000 : : : : Date: 12-22-92 (02:09) : : To: ALL : : From: MARCEL HENDRIX : : Subj: THREADING SPEED : : : : Penio Penev wrote about M. Anton Ertl's ``Threading speed'' : : : : M. Anton Ertl's benchmark intrigued me too, but it was in a : : foreign language (C). However, your Forth version I can : : reproduce. Here is what : : : : I found for the TMS320C30, a 32-bits, 33 MHz DSP chip from Texas : : Instruments. The output shown is produced from within my : : interactive target compiler for this chip (The tc is written in : : iForth, a 32-bit Forth for the '386. It runs under GO32 in : : protected mode on my PC). : : : .. : : | code cdummy next, end-code : : | : dummy ; : : | : dd FOR dummy NEXT ; : : | : cc FOR cdummy NEXT ; : : | : tara FOR NEXT ; : : | cr .( tara : ) timer-reset 10000000 tara .elapsed .( Indigo: 1300 ) : : | cr .( dd : ) timer-reset 10000000 dd .elapsed .( Indigo: 3100 ) : : | cr .( cc : ) timer-reset 10000000 cc .elapsed .( Indigo: 2500 ) : : tara : 3.630 seconds elapsed. Indigo: 1300 : : dd : 10.065 seconds elapsed. Indigo: 3100 : : cc : 10.065 seconds elapsed. Indigo: 2500 ok : : <TARGET> see cc : : $000005C7 ldi *ar0++(1),r0 08402001 .... MH> dpop, : : $000005C8 ldi $5CF,r2 086205CF .... : : $000005C9 push r2 0F220000 .... : : $000005CA push r0 0F200000 .... : : $000005CB push r0 0F200000 .... : : $000005CC ldi r0,r7 08070000 .... MH> loop count : : $000005CD addi 1,r2 02620001 .... : : $000005CE bu r2 68000002 .... : : $000005CF bu $5D3 6A000003 .... MH> LEAVE use : : $000005D0 callu $5B7 7200FFE6 .... : : $000005D1 subi 1,r7 18670001 .... MH> NEXT : : $000005D2 bne $5D0 6A06FFFD .... : : $000005D3 subi 3,sp 18740003 .... : : $000005D4 retsu 78800000 .... ok : : <TARGET> words : : : : 2 : : tara cc dd dummy : : cdummy ok : : <TARGET> see cdummy : : $000005B7 retsu 78800000 .... ok : : <TARGET> see dummy : : $000005B8 retsu 78800000 .... ok : : <TARGET> close-log : : --- : .. : : The long intro to FOR makes work a bit easier for the target : : compiler, and allows me to code FOR ... LEAVE ... NEXT if I want : : to. You cannot nest FOR NEXT's. : : The difinition of FOR NEXT I posted _can_ be nested. There is another : definition, which cannot. Words can be invoked from it, but You do not : have I in it. : : 1 : FOR( ( - a) TS V1 mov Drop V1 dec begin ; IMMEDIATE : 2 : )NEXT ( a) V1 Z= until V1 dec ; IMMEDIATE : : Measurements: : : TARA() FOR( )NEXT ; ok : COUNTER 1000000000 TARA() TIMER 62000 ok : : In dbx: : [RETRY, 0x10011ef4] addiu sp,sp,-4 : [RETRY, 0x10011ef8] move v1,s0 : [RETRY, 0x10011efc] lw s0,0(s8) : [RETRY, 0x10011f00] addiu s8,s8,4 : [RETRY, 0x10011f04] addiu v1,v1,-1 : [RETRY, 0x10011f08] bne v1,zero,0x10011f08 : [RETRY, 0x10011f0c] addiu v1,v1,-1 : *[RETRY, 0x10011f10] addiu sp,sp,4 : [RETRY, 0x10011f14] jr ra : [RETRY, 0x10011f18] lw ra,4(sp) : : : The constant 100,000,000 was changed to 10,000,000 because, as : : you can see, the code takes about three to four times as long to : : run as on your Indigo. It is possible to optimize CC above, by : : using bned instead of bne, but that won't give me 300% more : : speed(?) I really wonder how the R3000 does it. Branch : : prediction? : : The R3000 can deliver one result in one clock if the pipe is kept : full. It can utilize the branch delay slot (the instruction after the : branch). I was rather surprised (pleasantly), when I realised, that : the branch delay slot is enough to fill the pipe. Two clocks/loop at : 33 MHz means 17.5 loops/us = 17.5 Mloops/second. The 1000M loops : should be performed in 57.1 seconds in theaory. The agreement with : experiment is very good, because I have at least 10 other open windos, : one of which is a graphical clock, updated on a second basis. This : means, that I can utilise ~ 92.1% of the power of my machine without : sacrificing any convinience. What activities are caried on on a unburdened system? 8% of this powerfull CPU seems a little bit much for the clock. The other windows just waited for input from the keyboard. When I checked this with gr_osview, the CPU was almost all user's with very small ammount of sys. : : In my oppinion the R3000 has some way of processing two instructions : in the early stages of the pipe (perhaps the first 2 of 5 total). I : made the following test: : : : TT FOR( [ begin 0 Z= until nop ] )NEXT ; ok : COUNTER 100000000 TT TIMER 13000 ok : : This is one branch, which is taken, and one which is not. The timing : is ~ two times the original one, which means, that both branches : execute in the same time. On the other hand in the Reference Manual : clearly states, that the branch decision is made one clock after the : calculation of the target address and the condition. This is one clock : after the Instruction Fetch phase on the target instruction. : : The R4000 has an 8 stage pipe, and there (IMHO) maintaining two early : pipes is not possible (feasible). It has an instructions 'branch if : condition likely'. : : Another consequence of this feature is that You can make : call = jump and link, store, dec = 3 clocks : ret = jump register, load, inc = 3 clocks : ---------- : 6 clocks : : call unnestable = jump and link, store = 2 clocks : ret from unnestable = jump register, load = 2 clocks : ---------- : 4 clocks : : I like this processor. : : -- Penio. -- Penio.