home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.lang.forth
- Path: sparky!uunet!cs.utexas.edu!swrinde!emory!sol.ctr.columbia.edu!venezia!penev
- From: penev@venezia (Penio Penev)
- Subject: Re: Hardware ONLY issues
- References: <4192.UUL1.3#5129@willett.pgh.pa.us>
- Sender: nobody@ctr.columbia.edu
- Organization: Rockefeller University
- Date: Sun, 27 Dec 1992 21:38:21 GMT
- X-Newsreader: TIN [version 1.1 PL6]
- Message-ID: <1992Dec27.213821.29605@sol.ctr.columbia.edu>
- Reply-To: penev@venezia.rockefeller.edu
- X-Posted-From: venezia.rockefeller.edu
- NNTP-Posting-Host: sol.ctr.columbia.edu
- Lines: 141
-
- ForthNet articles from GEnie (ForthNet@willett.pgh.pa.us) wrote:
- : Category 9, Topic 2
- : Message 127 Sat Dec 26, 1992
- : ELLIOTT.C at 13:40 EST
- :
- : -----via CRS Premium Bulletin Board -
- : USR Dual Standard 16.8K (416) 629-7000
- :
- : Date: 12-22-92 (02:09)
- : To: ALL
- : From: MARCEL HENDRIX
- : Subj: THREADING SPEED
- :
- : Penio Penev wrote about M. Anton Ertl's ``Threading speed''
- :
- : M. Anton Ertl's benchmark intrigued me too, but it was in a
- : foreign language (C). However, your Forth version I can
- : reproduce. Here is what
- :
- : I found for the TMS320C30, a 32-bits, 33 MHz DSP chip from Texas
- : Instruments. The output shown is produced from within my
- : interactive target compiler for this chip (The tc is written in
- : iForth, a 32-bit Forth for the '386. It runs under GO32 in
- : protected mode on my PC).
- :
- ..
- : | code cdummy next, end-code
- : | : dummy ;
- : | : dd FOR dummy NEXT ;
- : | : cc FOR cdummy NEXT ;
- : | : tara FOR NEXT ;
- : | cr .( tara : ) timer-reset 10000000 tara .elapsed .( Indigo: 1300 )
- : | cr .( dd : ) timer-reset 10000000 dd .elapsed .( Indigo: 3100 )
- : | cr .( cc : ) timer-reset 10000000 cc .elapsed .( Indigo: 2500 )
- : tara : 3.630 seconds elapsed. Indigo: 1300
- : dd : 10.065 seconds elapsed. Indigo: 3100
- : cc : 10.065 seconds elapsed. Indigo: 2500 ok
- : <TARGET> see cc
- : $000005C7 ldi *ar0++(1),r0 08402001 .... MH> dpop,
- : $000005C8 ldi $5CF,r2 086205CF ....
- : $000005C9 push r2 0F220000 ....
- : $000005CA push r0 0F200000 ....
- : $000005CB push r0 0F200000 ....
- : $000005CC ldi r0,r7 08070000 .... MH> loop count
- : $000005CD addi 1,r2 02620001 ....
- : $000005CE bu r2 68000002 ....
- : $000005CF bu $5D3 6A000003 .... MH> LEAVE use
- : $000005D0 callu $5B7 7200FFE6 ....
- : $000005D1 subi 1,r7 18670001 .... MH> NEXT
- : $000005D2 bne $5D0 6A06FFFD ....
- : $000005D3 subi 3,sp 18740003 ....
- : $000005D4 retsu 78800000 .... ok
- : <TARGET> words
- :
- : 2
- : tara cc dd dummy
- : cdummy ok
- : <TARGET> see cdummy
- : $000005B7 retsu 78800000 .... ok
- : <TARGET> see dummy
- : $000005B8 retsu 78800000 .... ok
- : <TARGET> close-log
- : ---
- ..
- : The long intro to FOR makes work a bit easier for the target
- : compiler, and allows me to code FOR ... LEAVE ... NEXT if I want
- : to. You cannot nest FOR NEXT's.
-
- The difinition of FOR NEXT I posted _can_ be nested. There is another
- definition, which cannot. Words can be invoked from it, but You do not
- have I in it.
-
- 1 : FOR( ( - a) TS V1 mov Drop V1 dec begin ; IMMEDIATE
- 2 : )NEXT ( a) V1 Z= until V1 dec ; IMMEDIATE
-
- Measurements:
- : TARA() FOR( )NEXT ; ok
- COUNTER 1000000000 TARA() TIMER 62000 ok
-
- In dbx:
- [RETRY, 0x10011ef4] addiu sp,sp,-4
- [RETRY, 0x10011ef8] move v1,s0
- [RETRY, 0x10011efc] lw s0,0(s8)
- [RETRY, 0x10011f00] addiu s8,s8,4
- [RETRY, 0x10011f04] addiu v1,v1,-1
- [RETRY, 0x10011f08] bne v1,zero,0x10011f08
- [RETRY, 0x10011f0c] addiu v1,v1,-1
- *[RETRY, 0x10011f10] addiu sp,sp,4
- [RETRY, 0x10011f14] jr ra
- [RETRY, 0x10011f18] lw ra,4(sp)
-
- : The constant 100,000,000 was changed to 10,000,000 because, as
- : you can see, the code takes about three to four times as long to
- : run as on your Indigo. It is possible to optimize CC above, by
- : using bned instead of bne, but that won't give me 300% more
- : speed(?) I really wonder how the R3000 does it. Branch
- : prediction?
-
- The R3000 can deliver one result in one clock if the pipe is kept
- full. It can utilize the branch delay slot (the instruction after the
- branch). I was rather surprised (pleasantly), when I realised, that
- the branch delay slot is enough to fill the pipe. Two clocks/loop at
- 33 MHz means 17.5 loops/us = 17.5 Mloops/second. The 1000M loops
- should be performed in 57.1 seconds in theaory. The agreement with
- experiment is very good, because I have at least 10 other open windos,
- one of which is a graphical clock, updated on a second basis. This
- means, that I can utilise ~ 92.1% of the power of my machine without
- sacrificing any convinience.
-
- In my oppinion the R3000 has some way of processing two instructions
- in the early stages of the pipe (perhaps the first 2 of 5 total). I
- made the following test:
-
- : TT FOR( [ begin 0 Z= until nop ] )NEXT ; ok
- COUNTER 100000000 TT TIMER 13000 ok
-
- This is one branch, which is taken, and one which is not. The timing
- is ~ two times the original one, which means, that both branches
- execute in the same time. On the other hand in the Reference Manual
- clearly states, that the branch decision is made one clock after the
- calculation of the target address and the condition. This is one clock
- after the Instruction Fetch phase on the target instruction.
-
- The R4000 has an 8 stage pipe, and there (IMHO) maintaining two early
- pipes is not possible (feasible). It has an instructions 'branch if
- condition likely'.
-
- Another consequence of this feature is that You can make
- call = jump and link, store, dec = 3 clocks
- ret = jump register, load, inc = 3 clocks
- ----------
- 6 clocks
-
- call unnestable = jump and link, store = 2 clocks
- ret from unnestable = jump register, load = 2 clocks
- ----------
- 4 clocks
-
- I like this processor.
-
- -- Penio.
-