NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / sys / mac / programm / 18561 < prev next >

Wrap

Internet Message Format | 1992-11-17 | 20.8 KB

Path: sparky!uunet!pageworks.com!world!eff!sol.ctr.columbia.edu!spool.mu.edu!uwm.edu!ogicse!cs.uoregon.edu!mystix.cs.uoregon.edu!mkelly From: mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) Newsgroups: comp.sys.mac.programmer Subject: Re: Help! making an assembly routine faster Message-ID: <1992Nov18.010815.6649@cs.uoregon.edu> Date: 18 Nov 92 01:08:15 GMT Article-I.D.: cs.1992Nov18.010815.6649 References: <1992Nov14.200831.20477@nntp.hut.fi> <1992Nov16.014850.28678@cs.uoregon.edu> <1992Nov16.190947.9920@nntp.hut.fi> Sender: news@cs.uoregon.edu (Netnews Owner) Organization: High Risk Ventures Lines: 597 In article <1992Nov16.190947.9920@nntp.hut.fi> jmunkki@vipunen.hut.fi (Juri Munkki) writes: >In article <1992Nov16.014850.28678@cs.uoregon.edu> mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) writes: >> CMPI.B #0xFF, D2 >> BNE @hardway ; don't copy everything > >An addq.w #1, and then a beq might prove to be faster than the cmp with >an immediate value. You have to adjust the mask back to its old value, >if the test fails, but this can be done either with the jump tables >(not with the ones you are using now, but the longer ones I will suggest >later in this article) or by a subq.w #1 According to the Motorola manual, you're right. But in practice this slowed things down quite a bit. I can't figure out why. I replaced the CMPI with an ADDQ #1, then at @hardway I did a SUBQ #1. My test case is a 32x32 rect with a 32x32 filled circle as the mask. I think it would slow things down a lot more with more complicated masks. But still, I don't know why it's slower, since the CMPI takes 8 clock cycles and the ADDQ and SUBQ each take 4, so it really should be faster.... Then again, those timings are for the 68000 (and 68020 too I think), and I'm using a 68040. >> @hardway: >> ANDI.L #0xF0, D2 ; mask off the low four bits >> LSR.W #4, D2 ; shift bits 4-7 into bits 0-3 > >The AND is totally wasted. The LSR will do the masking for you. :) I don't know *what* I was thinking.... At this point, I ran my test again with the above modifications. They improved the speed by about 10%. (Changing the CMPI above decreased performance by about 30% with these other changes also in place.) >You can also eliminate the and and lsr, if you use two 256-entry jump >tables that simply ignore the high or low 4 bits. The tables will take >some memory (2 x 4 x 256 bytes), but they are easy to construct with >copy and paste. You mean two 16-entry jump tables, right? I didn't implement this, but instead made a separate CopyMask function that used a single 256-entry jump table, with 256 subroutines for each of the 256 possible mask-bytes. See the code fragment below. Hey, maybe I could just save 256 (times two) masks, AND each mask with the source and destination bytes, then OR those two results together to get the resulting pixel. Hmmm, I wonder if it would be even faster.... >> ADD.W D2, D2 ; double the index >> ADD.W @table(D2.W), D2 ; calculate the address >> JSR @table(D2.W) ; plot four pixels > >The 68020 has addressing modes that do the multiplication of the index. >I haven't needed them myself, but I'm fairly certain that you can improve >this part with the right addressing mode. Nope, I don't think so. You're talking about Address Register Indirect with Offset and Index, like so: @table( <no address in this case>, D2.W*2 ). The problem is that the value of D2 is preserved in that operation, so instead of D2 = (D2 * 2) + @table + (D2 * 2), you get D2 = D2 + @table + (D2 * 2). >Replace the jsr with a LEA An to the return address and a JMP to the >subroutine. Then jump back with a JMP (An). This is quite a bit faster >than a JSR/RTS combination, although it's not "good style". Wow, that made a big difference! About a 17% improvement, making the total speedup about 25%. >> CLR.L D2 ; clear the mask register >> MOVE.B (A2)+, D2 ; copy the next mask byte >> ANDI.B #0xF, D2 ; mask off the high four bits > >Use BFEXTU, if you must read the mask again. Remember that you can use >-1(A2), if you already incremented A2 or you might be able to account >for this with the bitfield offset. You can also use constant bitfield >offsets, if I remember correctly. I think you have some registers that >you could use, so you could store fairly constant bitfield indices >there. I'm not sure what you mean by constant offsets. I did this: BFEXTU -1(A2){4:4} and it slowed it down by about 4%. >> @sub8: ; mask = 1000 >> MOVE.B (A0)+, (A1)+ >> ADDQ.L #3, A0 >> ADDQ.L #3, A1 >> RTS > >A move.b (a0),(a1) along with addq #4 is faster on a 68000, but I >don't think it matters on new processors. I may be wrong, but you'll >probably never see the difference. You're right, it didn't make any difference at all on my '040. >In the deep mask version, you could unroll the loop. It's kind of >surprising the the 1 bit mask is actually faster, but it's mostly >because of the superior algorithm that allows you to directly copy >8 bytes at a time in the most common case. I tossed that code. I don't really think unrolling the loop will get it down to my current speed, which is more than twice as fast. >it might be as much as 10%. The only way to go beyond this is to make >the move.l commands aligned on long word destinations, as I mentioned >in my previous article. But as long as I align the source and destination Pixmaps, that isn't an issue, right? >I hope my articles offer proof for the other half of my .signature... :-) Definitely :) OK, here's the new code. The first one is the newer, better version of Quick8CopyMask, with most of the optimizations suggested by Juri. It's about 5.5 times as fast as QuickDraw's CopyMask, at least with my simple circle mask test case. The second one is a small part of a very large Quick8CopyMask that has 256 separate subroutines to handle each mask byte, rather than only 16 subroutines to handle a mask nibble (a nibble is half a byte, right?). It's far too long to post here, but if you want a copy I'll be happy to email it to you. It's about 6.5 times as fast as CopyMask; about 15% faster than the short version. I tested the routines with the mask used in the CalcCMask DTS snippet; the short version was 5.7 times as fast as CopyMask and the long version was 7 times as fast. And once again, if anyone can improve on these routines, please tell me how! void Quick8CopyMask( PixMapHandle srcMap, PixMapHandle dstMap, Ptr mask, Point srcPt, Point dstPt, short width, short height ) { register char *src; register char *dst; register long srcNewline; register long dstNewline; char mode32 = QD32COMPATIBLE; short w = (width >> 3) - 1; short e = (width & 0x07) - 1; short h = height - 1; // Set up pointers to the beginning of the memory to copy // and calculate the newline value for the source and destination src = GetPixBaseAddr( srcMap ) + (long) ((*srcMap)->rowBytes & 0x3fff) * srcPt.v + srcPt.h; srcNewline = ((*srcMap)->rowBytes & 0x3fff) - width; dst = GetPixBaseAddr( dstMap ) + (long) ((*dstMap)->rowBytes & 0x3fff) * dstPt.v + dstPt.h; dstNewline = ((*dstMap)->rowBytes & 0x3fff) - width; // Switch into 32 bit addressing mode SwapMMUMode( &mode32 ); // Copy the rect from the source to the destination asm { MOVE.W h, D0 ; put height loop variable in D0 MOVEA.L src, A0 ; put the source pixmap address in A0 MOVEA.L dst, A1 ; put the destination address in A1 MOVEA.L mask, A2 ; put the mask address in A2 CLR.L D2 ; clear the mask register @1: ; copy the next row MOVE.W w, D1 @2: ; copy the next eight bytes in the row MOVE.B (A2)+, D2 ; copy the next mask byte BEQ @nocopy ; if zero, don't copy anything CMPI.B #0xFF, D2 BNE @hardway ; don't copy everything MOVE.L (A0)+, (A1)+ ; copy all bytes MOVE.L (A0)+, (A1)+ DBF D1, @2 JMP @endloop @nocopy: ; copy no bytes ADDQ.L #8, A0 ADDQ.L #8, A1 DBF D1, @2 JMP @endloop @hardway: LSR.W #4, D2 ; shift bits 4-7 into bits 0-3 ADD.W D2, D2 ; double the index ADD.W @table(D2.W), D2 ; calculate the address LEA @rts1, A3 ; save the return address JMP @table(D2.W) ; plot four pixels @rts1: MOVE.B -1(A2), D2 ; copy the next mask byte ANDI.B #0xF, D2 ; mask off the high four bits ADD.W D2, D2 ; double the index ADD.W @table(D2.W), D2 ; calculate the address LEA @rts2, A3 ; save the return address JMP @table(D2.W) ; plot four pixels @rts2: DBF D1, @2 @endloop: TST.W e BLT @4 ; continue if e is less than 0 MOVE.B (A2)+, D2 ; copy the next mask byte MOVE.W e, D1 ; initialize the loop counter MOVEQ.L #7, D3 ; initialize the bit counter @3: ; copy the next byte BTST D3, D2 ; test the next bit in the mask BEQ @skip ; if zero, continue MOVE.B (A0)+, (A1)+ ; else copy the pixel SUBQ.L #1, D3 ; decrement the bit counter DBF D1, @3 JMP @4 @skip: ADDQ.L #1, A0 ADDQ.L #1, A1 SUBQ.L #1, D3 ; decrement the bit counter DBF D1, @3 @4: ADDA.L srcNewline, A0 ; bring the src pointer to the start of the next row ADDA.L dstNewline, A1 ; bring the dst pointer to the start of the next row DBF D0, @1 JMP @end ; skip to the end @table: DC.W @sub0 DC.W @sub1 DC.W @sub2 DC.W @sub3 DC.W @sub4 DC.W @sub5 DC.W @sub6 DC.W @sub7 DC.W @sub8 DC.W @sub9 DC.W @sub10 DC.W @sub11 DC.W @sub12 DC.W @sub13 DC.W @sub14 DC.W @sub15 @sub0: ; mask = 0000, draw nothing ADDQ.L #4, A0 ADDQ.L #4, A1 JMP (A3) ; RTS @sub1: ; mask = 0001 ADDQ.L #3, A0 ADDQ.L #3, A1 MOVE.B (A0)+, (A1)+ JMP (A3) ; RTS @sub2: ; mask = 0010 ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 JMP (A3) ; RTS @sub3: ; mask = 0011 ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.W (A0)+, (A1)+ JMP (A3) ; RTS @sub4: ; mask = 0100 ADDQ.L #1, A0 ADDQ.L #1, A1 MOVE.B (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 JMP (A3) ; RTS @sub5: ; mask = 0101 ADDQ.L #1, A0 ADDQ.L #1, A1 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.B (A0)+, (A1)+ JMP (A3) ; RTS @sub6: ; mask = 0110 ADDQ.L #1, A0 ADDQ.L #1, A1 MOVE.W (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 JMP (A3) ; RTS @sub7: ; mask = 0111 ADDQ.L #1, A0 ADDQ.L #1, A1 MOVE.B (A0)+, (A1)+ MOVE.W (A0)+, (A1)+ JMP (A3) ; RTS @sub8: ; mask = 1000 MOVE.B (A0), (A1) ADDQ.L #4, A0 ADDQ.L #4, A1 JMP (A3) ; RTS @sub9: ; mask = 1001 MOVE.B (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 MOVE.B (A0)+, (A1)+ JMP (A3) ; RTS @sub10: ; mask = 1010 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 JMP (A3) ; RTS @sub11: ; mask = 1011 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.W (A0)+, (A1)+ JMP (A3) ; RTS @sub12: ; mask = 1100 MOVE.W (A0), (A1) ADDQ.L #4, A0 ADDQ.L #4, A1 JMP (A3) ; RTS @sub13: ; mask = 1101 MOVE.W (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 MOVE.B (A0)+, (A1)+ JMP (A3) ; RTS @sub14: ; mask = 1110 MOVE.W (A0)+, (A1)+ MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 JMP (A3) ; RTS @sub15: ; mask = 1111 MOVE.L (A0)+, (A1)+ JMP (A3) ; RTS @end: } // Switch back to the previous addressing mode SwapMMUMode( &mode32 ); } And this is the extremely long version, truncated for this posting: void Quick8CopyMask( PixMapHandle srcMap, PixMapHandle dstMap, Ptr mask, Point srcPt, Point dstPt, short width, short height ) { register char *src; register char *dst; register long srcNewline; register long dstNewline; char mode32 = QD32COMPATIBLE; short w = (width >> 3) - 1; short e = (width & 0x07) - 1; short h = height - 1; // Set up pointers to the beginning of the memory to copy // and calculate the newline value for the source and destination src = GetPixBaseAddr( srcMap ) + (long) ((*srcMap)->rowBytes & 0x3fff) * srcPt.v + srcPt.h; srcNewline = ((*srcMap)->rowBytes & 0x3fff) - width; dst = GetPixBaseAddr( dstMap ) + (long) ((*dstMap)->rowBytes & 0x3fff) * dstPt.v + dstPt.h; dstNewline = ((*dstMap)->rowBytes & 0x3fff) - width; // Switch into 32 bit addressing mode SwapMMUMode( &mode32 ); // Copy the rect from the source to the destination asm { MOVE.W h, D0 ; put height loop variable in D0 MOVEA.L src, A0 ; put the source pixmap address in A0 MOVEA.L dst, A1 ; put the destination address in A1 MOVEA.L mask, A2 ; put the mask address in A2 CLR.L D2 ; clear the mask register @1: ; copy the next row MOVE.W w, D1 @2: ; copy the next eight bytes in the row CLR.W D2 ; clear the mask register MOVE.B (A2)+, D2 ; copy the next mask byte BEQ @nocopy ; if zero, don't copy anything CMPI.B #0xFF, D2 BNE @hardway ; don't copy everything MOVE.L (A0)+, (A1)+ ; copy all bytes MOVE.L (A0)+, (A1)+ DBF D1, @2 JMP @endloop @nocopy: ; copy no bytes ADDQ.L #8, A0 ADDQ.L #8, A1 DBF D1, @2 JMP @endloop @hardway: ADD.W D2, D2 ; double the index ADD.W @table(D2.W), D2 ; calculate the address JMP @table(D2.W) ; plot eight pixels @endloop: TST.W e BLT @4 ; continue if e is less than 0 MOVE.B (A2)+, D2 ; copy the next mask byte MOVE.W e, D1 ; initialize the loop counter MOVEQ.L #7, D3 ; initialize the bit counter @3: ; copy the next byte BTST D3, D2 ; test the next bit in the mask BEQ @skip ; if zero, continue MOVE.B (A0)+, (A1)+ ; else copy the pixel SUBQ.L #1, D3 ; decrement the bit counter DBF D1, @3 JMP @4 @skip: ADDQ.L #1, A0 ADDQ.L #1, A1 SUBQ.L #1, D3 ; decrement the bit counter DBF D1, @3 @4: ADDA.L srcNewline, A0 ; bring the src pointer to the start of the next row ADDA.L dstNewline, A1 ; bring the dst pointer to the start of the next row DBF D0, @1 JMP @end ; skip to the end @table: DC.W @sub0 DC.W @sub1 DC.W @sub2 DC.W @sub3 . . . . . . DC.W @sub253 DC.W @sub254 DC.W @sub255 @sub0: ; mask = 00000000 ADDQ.L #8, A0 ADDQ.L #8, A1 DBF D1, @2 JMP @endloop @sub1: ; mask = 00000001 ADDQ.L #7, A0 ADDQ.L #7, A1 MOVE.B (A0)+, (A1)+ DBF D1, @2 JMP @endloop @sub2: ; mask = 00000010 ADDQ.L #6, A0 ADDQ.L #6, A1 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 DBF D1, @2 JMP @endloop . . . . . . @sub182: ; mask = 10110110 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.W (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 MOVE.W (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 DBF D1, @2 JMP @endloop @sub183: ; mask = 10110111 MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 MOVE.W (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 MOVE.B (A0)+, (A1)+ MOVE.W (A0)+, (A1)+ DBF D1, @2 JMP @endloop . . . . . . @sub253: ; mask = 11111101 MOVE.L (A0)+, (A1)+ MOVE.W (A0), (A1) ADDQ.L #3, A0 ADDQ.L #3, A1 MOVE.B (A0)+, (A1)+ DBF D1, @2 JMP @endloop @sub254: ; mask = 11111110 MOVE.L (A0)+, (A1)+ MOVE.W (A0)+, (A1)+ MOVE.B (A0), (A1) ADDQ.L #2, A0 ADDQ.L #2, A1 DBF D1, @2 JMP @endloop @sub255: ; mask = 11111111 MOVE.L (A0)+, (A1)+ MOVE.L (A0)+, (A1)+ DBF D1, @2 JMP @endloop @end: } // Switch back to the previous addressing mode SwapMMUMode( &mode32 ); } -- _____________________________________________________________________________ Michael A. Kelly Senior Partner mkelly@cs.uoregon.edu High Risk Ventures _____________________________________________________________________________