home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!pageworks.com!world!eff!sol.ctr.columbia.edu!spool.mu.edu!uwm.edu!ogicse!cs.uoregon.edu!mystix.cs.uoregon.edu!mkelly
- From: mkelly@mystix.cs.uoregon.edu (Michael A. Kelly)
- Newsgroups: comp.sys.mac.programmer
- Subject: Re: Help! making an assembly routine faster
- Message-ID: <1992Nov18.010815.6649@cs.uoregon.edu>
- Date: 18 Nov 92 01:08:15 GMT
- Article-I.D.: cs.1992Nov18.010815.6649
- References: <1992Nov14.200831.20477@nntp.hut.fi> <1992Nov16.014850.28678@cs.uoregon.edu> <1992Nov16.190947.9920@nntp.hut.fi>
- Sender: news@cs.uoregon.edu (Netnews Owner)
- Organization: High Risk Ventures
- Lines: 597
-
- In article <1992Nov16.190947.9920@nntp.hut.fi> jmunkki@vipunen.hut.fi (Juri Munkki) writes:
- >In article <1992Nov16.014850.28678@cs.uoregon.edu> mkelly@mystix.cs.uoregon.edu (Michael A. Kelly) writes:
- >> CMPI.B #0xFF, D2
- >> BNE @hardway ; don't copy everything
- >
- >An addq.w #1, and then a beq might prove to be faster than the cmp with
- >an immediate value. You have to adjust the mask back to its old value,
- >if the test fails, but this can be done either with the jump tables
- >(not with the ones you are using now, but the longer ones I will suggest
- >later in this article) or by a subq.w #1
-
- According to the Motorola manual, you're right. But in practice this slowed
- things down quite a bit. I can't figure out why. I replaced the CMPI with
- an ADDQ #1, then at @hardway I did a SUBQ #1. My test case is a 32x32 rect
- with a 32x32 filled circle as the mask. I think it would slow things down
- a lot more with more complicated masks. But still, I don't know why it's
- slower, since the CMPI takes 8 clock cycles and the ADDQ and SUBQ each
- take 4, so it really should be faster.... Then again, those timings are
- for the 68000 (and 68020 too I think), and I'm using a 68040.
-
- >> @hardway:
- >> ANDI.L #0xF0, D2 ; mask off the low four bits
- >> LSR.W #4, D2 ; shift bits 4-7 into bits 0-3
- >
- >The AND is totally wasted. The LSR will do the masking for you.
-
- :) I don't know *what* I was thinking....
-
- At this point, I ran my test again with the above modifications. They
- improved the speed by about 10%. (Changing the CMPI above decreased
- performance by about 30% with these other changes also in place.)
-
- >You can also eliminate the and and lsr, if you use two 256-entry jump
- >tables that simply ignore the high or low 4 bits. The tables will take
- >some memory (2 x 4 x 256 bytes), but they are easy to construct with
- >copy and paste.
-
- You mean two 16-entry jump tables, right? I didn't implement this, but
- instead made a separate CopyMask function that used a single 256-entry
- jump table, with 256 subroutines for each of the 256 possible mask-bytes.
- See the code fragment below.
-
- Hey, maybe I could just save 256 (times two) masks, AND each mask with the
- source and destination bytes, then OR those two results together to get
- the resulting pixel. Hmmm, I wonder if it would be even faster....
-
- >> ADD.W D2, D2 ; double the index
- >> ADD.W @table(D2.W), D2 ; calculate the address
- >> JSR @table(D2.W) ; plot four pixels
- >
- >The 68020 has addressing modes that do the multiplication of the index.
- >I haven't needed them myself, but I'm fairly certain that you can improve
- >this part with the right addressing mode.
-
- Nope, I don't think so. You're talking about Address Register Indirect with
- Offset and Index, like so: @table( <no address in this case>, D2.W*2 ).
- The problem is that the value of D2 is preserved in that operation, so instead
- of D2 = (D2 * 2) + @table + (D2 * 2), you get D2 = D2 + @table + (D2 * 2).
-
- >Replace the jsr with a LEA An to the return address and a JMP to the
- >subroutine. Then jump back with a JMP (An). This is quite a bit faster
- >than a JSR/RTS combination, although it's not "good style".
-
- Wow, that made a big difference! About a 17% improvement, making the total
- speedup about 25%.
-
- >> CLR.L D2 ; clear the mask register
- >> MOVE.B (A2)+, D2 ; copy the next mask byte
- >> ANDI.B #0xF, D2 ; mask off the high four bits
- >
- >Use BFEXTU, if you must read the mask again. Remember that you can use
- >-1(A2), if you already incremented A2 or you might be able to account
- >for this with the bitfield offset. You can also use constant bitfield
- >offsets, if I remember correctly. I think you have some registers that
- >you could use, so you could store fairly constant bitfield indices
- >there.
-
- I'm not sure what you mean by constant offsets. I did this:
- BFEXTU -1(A2){4:4}
- and it slowed it down by about 4%.
-
- >> @sub8: ; mask = 1000
- >> MOVE.B (A0)+, (A1)+
- >> ADDQ.L #3, A0
- >> ADDQ.L #3, A1
- >> RTS
- >
- >A move.b (a0),(a1) along with addq #4 is faster on a 68000, but I
- >don't think it matters on new processors. I may be wrong, but you'll
- >probably never see the difference.
-
- You're right, it didn't make any difference at all on my '040.
-
- >In the deep mask version, you could unroll the loop. It's kind of
- >surprising the the 1 bit mask is actually faster, but it's mostly
- >because of the superior algorithm that allows you to directly copy
- >8 bytes at a time in the most common case.
-
- I tossed that code. I don't really think unrolling the loop will get it
- down to my current speed, which is more than twice as fast.
-
- >it might be as much as 10%. The only way to go beyond this is to make
- >the move.l commands aligned on long word destinations, as I mentioned
- >in my previous article.
-
- But as long as I align the source and destination Pixmaps, that isn't an
- issue, right?
-
- >I hope my articles offer proof for the other half of my .signature... :-)
-
- Definitely :)
-
-
- OK, here's the new code. The first one is the newer, better version of
- Quick8CopyMask, with most of the optimizations suggested by Juri. It's
- about 5.5 times as fast as QuickDraw's CopyMask, at least with my simple
- circle mask test case. The second one is a small part of a very large
- Quick8CopyMask that has 256 separate subroutines to handle each mask
- byte, rather than only 16 subroutines to handle a mask nibble (a nibble is
- half a byte, right?). It's far too long to post here, but if you want a
- copy I'll be happy to email it to you. It's about 6.5 times as fast as
- CopyMask; about 15% faster than the short version.
-
- I tested the routines with the mask used in the CalcCMask DTS snippet;
- the short version was 5.7 times as fast as CopyMask and the long version
- was 7 times as fast.
-
- And once again, if anyone can improve on these routines, please tell me how!
-
-
- void Quick8CopyMask(
- PixMapHandle srcMap,
- PixMapHandle dstMap,
- Ptr mask,
- Point srcPt,
- Point dstPt,
- short width,
- short height )
- {
-
- register char *src;
- register char *dst;
- register long srcNewline;
- register long dstNewline;
- char mode32 = QD32COMPATIBLE;
- short w = (width >> 3) - 1;
- short e = (width & 0x07) - 1;
- short h = height - 1;
-
- // Set up pointers to the beginning of the memory to copy
- // and calculate the newline value for the source and destination
-
- src = GetPixBaseAddr( srcMap ) + (long) ((*srcMap)->rowBytes & 0x3fff) * srcPt.v + srcPt.h;
- srcNewline = ((*srcMap)->rowBytes & 0x3fff) - width;
-
- dst = GetPixBaseAddr( dstMap ) + (long) ((*dstMap)->rowBytes & 0x3fff) * dstPt.v + dstPt.h;
- dstNewline = ((*dstMap)->rowBytes & 0x3fff) - width;
-
- // Switch into 32 bit addressing mode
-
- SwapMMUMode( &mode32 );
-
- // Copy the rect from the source to the destination
-
- asm {
-
- MOVE.W h, D0 ; put height loop variable in D0
- MOVEA.L src, A0 ; put the source pixmap address in A0
- MOVEA.L dst, A1 ; put the destination address in A1
- MOVEA.L mask, A2 ; put the mask address in A2
- CLR.L D2 ; clear the mask register
-
- @1: ; copy the next row
- MOVE.W w, D1
-
- @2: ; copy the next eight bytes in the row
-
- MOVE.B (A2)+, D2 ; copy the next mask byte
- BEQ @nocopy ; if zero, don't copy anything
-
- CMPI.B #0xFF, D2
- BNE @hardway ; don't copy everything
-
- MOVE.L (A0)+, (A1)+ ; copy all bytes
- MOVE.L (A0)+, (A1)+
-
- DBF D1, @2
- JMP @endloop
-
- @nocopy: ; copy no bytes
- ADDQ.L #8, A0
- ADDQ.L #8, A1
-
- DBF D1, @2
- JMP @endloop
-
- @hardway:
- LSR.W #4, D2 ; shift bits 4-7 into bits 0-3
- ADD.W D2, D2 ; double the index
- ADD.W @table(D2.W), D2 ; calculate the address
- LEA @rts1, A3 ; save the return address
- JMP @table(D2.W) ; plot four pixels
- @rts1:
-
- MOVE.B -1(A2), D2 ; copy the next mask byte
- ANDI.B #0xF, D2 ; mask off the high four bits
- ADD.W D2, D2 ; double the index
- ADD.W @table(D2.W), D2 ; calculate the address
- LEA @rts2, A3 ; save the return address
- JMP @table(D2.W) ; plot four pixels
- @rts2:
-
- DBF D1, @2
-
- @endloop:
-
- TST.W e
- BLT @4 ; continue if e is less than 0
-
- MOVE.B (A2)+, D2 ; copy the next mask byte
- MOVE.W e, D1 ; initialize the loop counter
- MOVEQ.L #7, D3 ; initialize the bit counter
-
- @3: ; copy the next byte
- BTST D3, D2 ; test the next bit in the mask
- BEQ @skip ; if zero, continue
- MOVE.B (A0)+, (A1)+ ; else copy the pixel
- SUBQ.L #1, D3 ; decrement the bit counter
- DBF D1, @3
- JMP @4
- @skip:
- ADDQ.L #1, A0
- ADDQ.L #1, A1
- SUBQ.L #1, D3 ; decrement the bit counter
- DBF D1, @3
-
- @4:
- ADDA.L srcNewline, A0 ; bring the src pointer to the start of the next row
- ADDA.L dstNewline, A1 ; bring the dst pointer to the start of the next row
-
- DBF D0, @1
-
- JMP @end ; skip to the end
-
- @table:
- DC.W @sub0
- DC.W @sub1
- DC.W @sub2
- DC.W @sub3
- DC.W @sub4
- DC.W @sub5
- DC.W @sub6
- DC.W @sub7
- DC.W @sub8
- DC.W @sub9
- DC.W @sub10
- DC.W @sub11
- DC.W @sub12
- DC.W @sub13
- DC.W @sub14
- DC.W @sub15
-
- @sub0: ; mask = 0000, draw nothing
- ADDQ.L #4, A0
- ADDQ.L #4, A1
- JMP (A3) ; RTS
-
- @sub1: ; mask = 0001
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- MOVE.B (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub2: ; mask = 0010
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- JMP (A3) ; RTS
-
- @sub3: ; mask = 0011
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.W (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub4: ; mask = 0100
- ADDQ.L #1, A0
- ADDQ.L #1, A1
- MOVE.B (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- JMP (A3) ; RTS
-
- @sub5: ; mask = 0101
- ADDQ.L #1, A0
- ADDQ.L #1, A1
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.B (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub6: ; mask = 0110
- ADDQ.L #1, A0
- ADDQ.L #1, A1
- MOVE.W (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- JMP (A3) ; RTS
-
- @sub7: ; mask = 0111
- ADDQ.L #1, A0
- ADDQ.L #1, A1
- MOVE.B (A0)+, (A1)+
- MOVE.W (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub8: ; mask = 1000
- MOVE.B (A0), (A1)
- ADDQ.L #4, A0
- ADDQ.L #4, A1
- JMP (A3) ; RTS
-
- @sub9: ; mask = 1001
- MOVE.B (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- MOVE.B (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub10: ; mask = 1010
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- JMP (A3) ; RTS
-
- @sub11: ; mask = 1011
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.W (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub12: ; mask = 1100
- MOVE.W (A0), (A1)
- ADDQ.L #4, A0
- ADDQ.L #4, A1
- JMP (A3) ; RTS
-
- @sub13: ; mask = 1101
- MOVE.W (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- MOVE.B (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @sub14: ; mask = 1110
- MOVE.W (A0)+, (A1)+
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- JMP (A3) ; RTS
-
- @sub15: ; mask = 1111
- MOVE.L (A0)+, (A1)+
- JMP (A3) ; RTS
-
- @end:
-
- }
-
- // Switch back to the previous addressing mode
-
- SwapMMUMode( &mode32 );
-
- }
-
-
-
-
- And this is the extremely long version, truncated for this posting:
-
-
- void Quick8CopyMask(
- PixMapHandle srcMap,
- PixMapHandle dstMap,
- Ptr mask,
- Point srcPt,
- Point dstPt,
- short width,
- short height )
- {
-
- register char *src;
- register char *dst;
- register long srcNewline;
- register long dstNewline;
- char mode32 = QD32COMPATIBLE;
- short w = (width >> 3) - 1;
- short e = (width & 0x07) - 1;
- short h = height - 1;
-
- // Set up pointers to the beginning of the memory to copy
- // and calculate the newline value for the source and destination
-
- src = GetPixBaseAddr( srcMap ) + (long) ((*srcMap)->rowBytes & 0x3fff) * srcPt.v + srcPt.h;
- srcNewline = ((*srcMap)->rowBytes & 0x3fff) - width;
-
- dst = GetPixBaseAddr( dstMap ) + (long) ((*dstMap)->rowBytes & 0x3fff) * dstPt.v + dstPt.h;
- dstNewline = ((*dstMap)->rowBytes & 0x3fff) - width;
-
- // Switch into 32 bit addressing mode
-
- SwapMMUMode( &mode32 );
-
- // Copy the rect from the source to the destination
-
- asm {
-
- MOVE.W h, D0 ; put height loop variable in D0
- MOVEA.L src, A0 ; put the source pixmap address in A0
- MOVEA.L dst, A1 ; put the destination address in A1
- MOVEA.L mask, A2 ; put the mask address in A2
- CLR.L D2 ; clear the mask register
-
- @1: ; copy the next row
- MOVE.W w, D1
-
- @2: ; copy the next eight bytes in the row
-
- CLR.W D2 ; clear the mask register
- MOVE.B (A2)+, D2 ; copy the next mask byte
- BEQ @nocopy ; if zero, don't copy anything
-
- CMPI.B #0xFF, D2
- BNE @hardway ; don't copy everything
-
- MOVE.L (A0)+, (A1)+ ; copy all bytes
- MOVE.L (A0)+, (A1)+
-
- DBF D1, @2
- JMP @endloop
-
- @nocopy: ; copy no bytes
- ADDQ.L #8, A0
- ADDQ.L #8, A1
-
- DBF D1, @2
- JMP @endloop
-
- @hardway:
- ADD.W D2, D2 ; double the index
- ADD.W @table(D2.W), D2 ; calculate the address
- JMP @table(D2.W) ; plot eight pixels
-
- @endloop:
-
- TST.W e
- BLT @4 ; continue if e is less than 0
-
- MOVE.B (A2)+, D2 ; copy the next mask byte
- MOVE.W e, D1 ; initialize the loop counter
- MOVEQ.L #7, D3 ; initialize the bit counter
-
- @3: ; copy the next byte
- BTST D3, D2 ; test the next bit in the mask
- BEQ @skip ; if zero, continue
- MOVE.B (A0)+, (A1)+ ; else copy the pixel
- SUBQ.L #1, D3 ; decrement the bit counter
- DBF D1, @3
- JMP @4
- @skip:
- ADDQ.L #1, A0
- ADDQ.L #1, A1
- SUBQ.L #1, D3 ; decrement the bit counter
- DBF D1, @3
-
- @4:
- ADDA.L srcNewline, A0 ; bring the src pointer to the start of the next row
- ADDA.L dstNewline, A1 ; bring the dst pointer to the start of the next row
-
- DBF D0, @1
-
- JMP @end ; skip to the end
-
- @table:
- DC.W @sub0
- DC.W @sub1
- DC.W @sub2
- DC.W @sub3
- . .
- . .
- . .
- DC.W @sub253
- DC.W @sub254
- DC.W @sub255
-
- @sub0: ; mask = 00000000
- ADDQ.L #8, A0
- ADDQ.L #8, A1
- DBF D1, @2
- JMP @endloop
-
- @sub1: ; mask = 00000001
- ADDQ.L #7, A0
- ADDQ.L #7, A1
- MOVE.B (A0)+, (A1)+
- DBF D1, @2
- JMP @endloop
-
- @sub2: ; mask = 00000010
- ADDQ.L #6, A0
- ADDQ.L #6, A1
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- DBF D1, @2
- JMP @endloop
-
- . .
- . .
- . .
-
- @sub182: ; mask = 10110110
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.W (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- MOVE.W (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- DBF D1, @2
- JMP @endloop
-
- @sub183: ; mask = 10110111
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- MOVE.W (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- MOVE.B (A0)+, (A1)+
- MOVE.W (A0)+, (A1)+
- DBF D1, @2
- JMP @endloop
-
- . .
- . .
- . .
-
- @sub253: ; mask = 11111101
- MOVE.L (A0)+, (A1)+
- MOVE.W (A0), (A1)
- ADDQ.L #3, A0
- ADDQ.L #3, A1
- MOVE.B (A0)+, (A1)+
- DBF D1, @2
- JMP @endloop
-
- @sub254: ; mask = 11111110
- MOVE.L (A0)+, (A1)+
- MOVE.W (A0)+, (A1)+
- MOVE.B (A0), (A1)
- ADDQ.L #2, A0
- ADDQ.L #2, A1
- DBF D1, @2
- JMP @endloop
-
- @sub255: ; mask = 11111111
- MOVE.L (A0)+, (A1)+
- MOVE.L (A0)+, (A1)+
- DBF D1, @2
- JMP @endloop
-
- @end:
-
- }
-
- // Switch back to the previous addressing mode
-
- SwapMMUMode( &mode32 );
-
- }
-
-
- --
- _____________________________________________________________________________
- Michael A. Kelly Senior Partner
- mkelly@cs.uoregon.edu High Risk Ventures
- _____________________________________________________________________________
-