NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / lang / forth / 3685 < prev next >

Wrap

Text File | 1992-12-27 | 6.2 KB | 157 lines

Newsgroups: comp.lang.forth Path: sparky!uunet!cs.utexas.edu!swrinde!emory!sol.ctr.columbia.edu!venezia!penev From: penev@venezia (Penio Penev) Subject: Re: Hardware ONLY issues References: <4192.UUL1.3#5129@willett.pgh.pa.us> Sender: nobody@ctr.columbia.edu Organization: Rockefeller University Date: Sun, 27 Dec 1992 21:38:21 GMT X-Newsreader: TIN [version 1.1 PL6] Message-ID: <1992Dec27.213821.29605@sol.ctr.columbia.edu> Reply-To: penev@venezia.rockefeller.edu X-Posted-From: venezia.rockefeller.edu NNTP-Posting-Host: sol.ctr.columbia.edu Lines: 141 ForthNet articles from GEnie (ForthNet@willett.pgh.pa.us) wrote: : Category 9, Topic 2 : Message 127 Sat Dec 26, 1992 : ELLIOTT.C at 13:40 EST : : -----via CRS Premium Bulletin Board - : USR Dual Standard 16.8K (416) 629-7000 : : Date: 12-22-92 (02:09) : To: ALL : From: MARCEL HENDRIX : Subj: THREADING SPEED : : Penio Penev wrote about M. Anton Ertl's ``Threading speed'' : : M. Anton Ertl's benchmark intrigued me too, but it was in a : foreign language (C). However, your Forth version I can : reproduce. Here is what : : I found for the TMS320C30, a 32-bits, 33 MHz DSP chip from Texas : Instruments. The output shown is produced from within my : interactive target compiler for this chip (The tc is written in : iForth, a 32-bit Forth for the '386. It runs under GO32 in : protected mode on my PC). : .. : | code cdummy next, end-code : | : dummy ; : | : dd FOR dummy NEXT ; : | : cc FOR cdummy NEXT ; : | : tara FOR NEXT ; : | cr .( tara : ) timer-reset 10000000 tara .elapsed .( Indigo: 1300 ) : | cr .( dd : ) timer-reset 10000000 dd .elapsed .( Indigo: 3100 ) : | cr .( cc : ) timer-reset 10000000 cc .elapsed .( Indigo: 2500 ) : tara : 3.630 seconds elapsed. Indigo: 1300 : dd : 10.065 seconds elapsed. Indigo: 3100 : cc : 10.065 seconds elapsed. Indigo: 2500 ok : <TARGET> see cc : $000005C7 ldi *ar0++(1),r0 08402001 .... MH> dpop, : $000005C8 ldi $5CF,r2 086205CF .... : $000005C9 push r2 0F220000 .... : $000005CA push r0 0F200000 .... : $000005CB push r0 0F200000 .... : $000005CC ldi r0,r7 08070000 .... MH> loop count : $000005CD addi 1,r2 02620001 .... : $000005CE bu r2 68000002 .... : $000005CF bu $5D3 6A000003 .... MH> LEAVE use : $000005D0 callu $5B7 7200FFE6 .... : $000005D1 subi 1,r7 18670001 .... MH> NEXT : $000005D2 bne $5D0 6A06FFFD .... : $000005D3 subi 3,sp 18740003 .... : $000005D4 retsu 78800000 .... ok : <TARGET> words : : 2 : tara cc dd dummy : cdummy ok : <TARGET> see cdummy : $000005B7 retsu 78800000 .... ok : <TARGET> see dummy : $000005B8 retsu 78800000 .... ok : <TARGET> close-log : --- .. : The long intro to FOR makes work a bit easier for the target : compiler, and allows me to code FOR ... LEAVE ... NEXT if I want : to. You cannot nest FOR NEXT's. The difinition of FOR NEXT I posted _can_ be nested. There is another definition, which cannot. Words can be invoked from it, but You do not have I in it. 1 : FOR( ( - a) TS V1 mov Drop V1 dec begin ; IMMEDIATE 2 : )NEXT ( a) V1 Z= until V1 dec ; IMMEDIATE Measurements: : TARA() FOR( )NEXT ; ok COUNTER 1000000000 TARA() TIMER 62000 ok In dbx: [RETRY, 0x10011ef4] addiu sp,sp,-4 [RETRY, 0x10011ef8] move v1,s0 [RETRY, 0x10011efc] lw s0,0(s8) [RETRY, 0x10011f00] addiu s8,s8,4 [RETRY, 0x10011f04] addiu v1,v1,-1 [RETRY, 0x10011f08] bne v1,zero,0x10011f08 [RETRY, 0x10011f0c] addiu v1,v1,-1 *[RETRY, 0x10011f10] addiu sp,sp,4 [RETRY, 0x10011f14] jr ra [RETRY, 0x10011f18] lw ra,4(sp) : The constant 100,000,000 was changed to 10,000,000 because, as : you can see, the code takes about three to four times as long to : run as on your Indigo. It is possible to optimize CC above, by : using bned instead of bne, but that won't give me 300% more : speed(?) I really wonder how the R3000 does it. Branch : prediction? The R3000 can deliver one result in one clock if the pipe is kept full. It can utilize the branch delay slot (the instruction after the branch). I was rather surprised (pleasantly), when I realised, that the branch delay slot is enough to fill the pipe. Two clocks/loop at 33 MHz means 17.5 loops/us = 17.5 Mloops/second. The 1000M loops should be performed in 57.1 seconds in theaory. The agreement with experiment is very good, because I have at least 10 other open windos, one of which is a graphical clock, updated on a second basis. This means, that I can utilise ~ 92.1% of the power of my machine without sacrificing any convinience. In my oppinion the R3000 has some way of processing two instructions in the early stages of the pipe (perhaps the first 2 of 5 total). I made the following test: : TT FOR( [ begin 0 Z= until nop ] )NEXT ; ok COUNTER 100000000 TT TIMER 13000 ok This is one branch, which is taken, and one which is not. The timing is ~ two times the original one, which means, that both branches execute in the same time. On the other hand in the Reference Manual clearly states, that the branch decision is made one clock after the calculation of the target address and the condition. This is one clock after the Instruction Fetch phase on the target instruction. The R4000 has an 8 stage pipe, and there (IMHO) maintaining two early pipes is not possible (feasible). It has an instructions 'branch if condition likely'. Another consequence of this feature is that You can make call = jump and link, store, dec = 3 clocks ret = jump register, load, inc = 3 clocks ---------- 6 clocks call unnestable = jump and link, store = 2 clocks ret from unnestable = jump register, load = 2 clocks ---------- 4 clocks I like this processor. -- Penio.