HPAVC

home *** CD-ROM | disk | FTP | other *** search

/ HPAVC / HPAVC CD-ROM.iso / pc / PASROTO.ZIP / ROTO.ASM < prev next >

Wrap

Assembly Source File | 1996-05-23 | 12.3 KB | 479 lines

comment #/* cache optimized roto-zoomer (c) '95/96 Niklas Beisert / pascal Use this in your non commercial productions if you are too lazy, or use some of the ideas in this "article" if you like, but don't be too lazy to greet me then! ;) Spread it only with all associated files: FILE_ID.DIZ ROTO.ASM ROTO.H ROTO.OBJ MAKEFILE ROTOEXAM.CPP ROTO.EXE TIMER.ASM TIMER.H Send me some money: :) pascal@nightmare.harz.de, Niklas Beisert@2:2437/301.44 Niklas Beisert, Oberstrasse 84, 20149 Hamburg, Germany Thanks go to jmagic/complex for releasing some source of his asm94 intro. The texture innerloop is quite fast and pentium optimized, so I used it! ;) Furthermore thanks go to Daredevil and Tran for pmode/w (lite). usage: texturescreen dst,src,x0,y0,xx,xy,yx,yy,xblk,yblk,scrwid,op,dir dst: destination address. eg:0xA0000 src: segment aligned (0xijkl0000) 256x256 map x0,y0,xx,xy,yx,yy: 16:16 fixpoint transformation matrix xblk,yblk: a block is 8x8, so 320x200 would be 40,25 scrwid: bytes per line op: opcode: 0x89:mov, 0x01:add... dir: blockorder: 0:horizontal first, 1:vertical first tested and compiled with watcom c++ 10.0 and tasm 3.1 transformation: dst[x,y]=src[x0+xx*x+xy*y,y0+yx*x+yy*y] history and background: Second Reality: you remeber the head with the pentagram and the lens effect... then a flash and whooopsy, what's that? this is 160x100... Why? Otherwise it'd be too slow! But why, This is such a simple effect? When I (some time later) recoded this effect I noticed that the framerate drops at a certain angle, and the only reason could be the cache. The processor cache is organized in a special way to have fast access to it's memory. So you have cache lines of 16 (32) bytes on a 486 (pentium) which are atoms. They all have a tag address field which stores the position of the 16 (32) bytes in memory. Then you have 4 (2) ways which are in a way 4 (2) equal caches which can be processed at the same time. Finally there are 256 sets of one cache line per way. Bits 4-11 (5-12) of the address determine the used set for a memory access. At a memory access, the address is split into 3 parts: bits 0-3 (0-4) determine the byte in a line, bits 4-11 (5-12) determine the set, and bits 12-31 (13-31) are the tag address. The tag address is then compared to the tag addresses of the 4 (2) lines of the set. If one matches it is a cache hit, if not you get a cache miss, 16 (32) bytes are read from memory to the least recently used cache line of that set. This takes about 23 cycles on a 486dx2-66, while a cache hit takes no extra cycles. A cache is most effective if you read the memory in a linear order like you do it in a rotozoomer at low angles. You then get one cache miss out of 16 (32) memory accesses. Now imagine the angle is exactly 90°. You would then read then memory in steps of 256, after 8k the first cache line is overwritten, so if you process the next line, it is a cache miss. This results in 100% cache misses... How to optimize it? I had several discussions with Scholar / $eeN on this topic. (hiho!) We though about rendering the screen in a different order, so that the texture is read in a linear fashion. This would be diagonal lines instead of h-lines. But this is not a fast solution either, and more complicated anyway. You could also keep prerotated versions of the texture, but this would require 2x or more the amount of memory, and you are limited to a fixed texture if you do not want to modify 2 textures all the time. The 8x8 block approach was a good compromise. :) You can write dwords, and do not need too much memory while the cache contents are not destroyed. You can also use this 8x8 block approach to optimize movelist-tunnels: keep the movelist linear, while you go though the 8x8 blocks. And you can do other nice things with 8x8 blocks... ;)))))))))) Which I cannot tell you yet. probably later! =} Cache optimizing seems to be quite stupid for vector engines, but it is ESSENTIAL for fast bitmap effects. This little assembler fragments show you what cache can do and what it cannot: fastloop: mov dx,12 l1: mov cx,32768 l2: mov ax,[0] mov ax,[0] mov ax,[0] mov ax,[0] mov ax,[0] dec cx jnz l2 dec dx jnz l1 slowloop: mov dx,12 l3: mov cx,32768 l4: mov ax,[2047] mov ax,[4095] mov ax,[6143] mov ax,[8191] mov ax,[10239] dec cx jnz l4 dec dx jnz l3 On a 486dx2-66 the first loop is about 25-50 times as fast as the second one. If you don't believe me, try it yourself. On a pentium 3 moves are enough to show the effects of the cache: mov ax,[4095] mov ax,[8191] mov ax,[12287] use those in the loop. description: This rotozoomer (also stretcher if you like...) does not process the picture like it is usually done, line by line, but block by block. The reason is simple: If you do it line by line you get 100% cache misses beyond a certain rotation angle (~60° on a 486dx 8k processor cache). 100% cache misses mean 64000 (mode 13h) times 23 cycles on a 486dx2-66 (experimental value), which is > 20ms, and your frame rate cannot get any better than 50Hz (ignoring all other operations). You might end up at 30Hz at certain angels and 100Hz at others. If you do it block by block you can reduce the cache misses to about 4000 to 8000 (2-5 ms), which has only little effect on the framerate. You end up with a rotozoomer which runs at constant 100fps on a dx2-66. or constant 350fps on a p120. Every 8x8 block is processed in the usual way, ie. 8 pixels, next line, 8 pixels, next line... The blocks are then drawn either in horizontal or vertical order, ie. row by row vs column by column. If used correctly this feature reduces the number of cache misses still a bit. You should use vertical order, if the angle is about 90° or 270°. If it is rather 0° or 180° use horizontal order. I had no problem to use 4 rotozoomers at the same time in our intro "LASSE REINB0NG" which was still quite smooth on a 486dx2-66 (>15fps). This is the same routine as used in the intro, so believe me, it's fast. (only that you can now control the blockorder) Nothing was slowed down! On a P120 this routine nearly runs in 1 frame / retrace... (did I say 640x480??? :) ) That is why there is the bytes/line parameter. You can use this routine with segmented screen memory. In 640x480 you should set the virtual screen width to 1024. Then the bytes/line is not equal to yblocks*8. You can then process a 80x8 blocks (640x64) range with this routine without segment changes. If you process 7.5 of these ranges with segment changes in between you can fill the screen with this routine. */# .486 model flat,prolog locals .code adc_ecx macro lab db 081h,0D1h lab dd 0 endm adc_edx macro lab db 081h,0D2h lab dd 0 endm add_edx macro lab db 081h,0C2h lab dd 0 endm add_ecx macro lab db 081h,0C1h lab dd 0 endm sub_ecx macro lab db 081h,0E9h lab dd 0 endm add_edi macro lab db 081h,0C7h lab dd 0 endm textureblock proc add_ecx beforedfy0 mov bh,dl adc dl,0 mov bl,cl @@block: add_edx dfx0dy0_0 mov al,[ebx] adc_ecx dfy0dx0_0 mov bh,dl mov bl,cl adc_edx dfx0dy0_1 mov ah,[ebx] adc_ecx dfy0dx0_1 mov bh,dl bswap eax mov bl,cl adc_edx dfx0dy0_2 mov ah,[ebx] adc_ecx dfy0dx0_2 mov bh,dl mov bl,cl adc_edx dfx0dy0_3 mov al,[ebx] adc_ecx dfy0dx0_3 mov bh,dl bswap eax adc dl,0 op0 db 089h,007h ;//op [edi],eax mov bl,cl add_edx dfx0dy0_4 mov al,[ebx] adc_ecx dfy0dx0_4 mov bh,dl mov bl,cl adc_edx dfx0dy0_5 mov ah,[ebx] adc_ecx dfy0dx0_5 mov bh,dl bswap eax mov bl,cl adc_edx dfx0dy0_6 mov ah,[ebx] adc_ecx dfy7dx0 mov bh,dl mov bl,cl adc_edx dfx7dy7 mov al,[ebx] adc_ecx dfy0dx7 mov bh,dl bswap eax adc dl,0 op1 db 089h,047h,004h ;//op [edi+4],eax mov bl,cl add_edi scrwidth dec esi jnz @@block sub_ecx afterdfy0 sbb dl,0 ret endp public texturescreen_ texturescreen_ proc dst:dword, src:dword, x0:dword, y0:dword, xx:dword, xy:dword, yx:dword, yy:dword, xblocks:dword, yblocks:dword, scrwid:dword, op:dword, dir:dword local c1:dword, c2:dword, retxf1:dword, retyf1:dword, retsp1:dword, retxf2:dword, retyf2:dword, retsp2:dword, retxi1:byte, retyi1:byte, retxi2:byte, retyi2:byte lea edi,textureblock+128 mov eax,scrwid mov [edi-128+scrwidth-textureblock],eax mov al,byte ptr op mov [edi-128+op0-textureblock],al mov [edi-128+op1-textureblock],al mov eax,xx shl eax,16 mov al,byte ptr xy+2 mov [edi-128+dfx0dy0_0-textureblock],eax mov [edi-128+dfx0dy0_1-textureblock],eax mov [edi-128+dfx0dy0_2-textureblock],eax mov [edi-128+dfx0dy0_3-textureblock],eax mov [edi-128+dfx0dy0_4-textureblock],eax mov [edi-128+dfx0dy0_5-textureblock],eax mov [edi-128+dfx0dy0_6-textureblock],eax mov ebx,xy shl ebx,3 sub ebx,xy neg ebx add ebx,yy shr ebx,16 mov al,bl mov ebx,xx shl ebx,19 sub eax,ebx mov ebx,yx shl ebx,16 add eax,ebx mov [edi-128+dfx7dy7-textureblock],eax mov eax,xy shl eax,16 mov [edi-128+beforedfy0-textureblock],eax mov [edi-128+afterdfy0-textureblock],eax mov al,byte ptr xx+2 mov [edi-128+dfy0dx0_0-textureblock],eax mov [edi-128+dfy0dx0_1-textureblock],eax mov [edi-128+dfy0dx0_2-textureblock],eax mov [edi-128+dfy0dx0_3-textureblock],eax mov [edi-128+dfy0dx0_4-textureblock],eax mov [edi-128+dfy0dx0_5-textureblock],eax mov ebx,xx shl ebx,3 sub ebx,xx neg ebx add ebx,yx shr ebx,16 mov al,bl mov [edi-128+dfy0dx7-textureblock],eax mov ebx,xy shl ebx,19 sub eax,ebx mov ebx,yy shl ebx,16 add eax,ebx mov al,byte ptr xx+2 mov [edi-128+dfy7dx0-textureblock],eax shl xx,3 shl xy,3 shl yx,3 shl yy,3 cmp dir,0 jne @@vert @@horz: mov eax,xx sub eax,yx rol eax,16 mov retxi1,al and eax,not 65535 mov retxf1,eax mov eax,xy sub eax,yy rol eax,16 mov retyi1,al and eax,not 65535 mov retyf1,eax mov eax,scrwid neg eax inc eax shl eax,3 mov retsp1,eax mov eax,xblocks neg eax imul xx add eax,yx rol eax,16 mov retxi2,al and eax,not 65535 mov retxf2,eax mov eax,xblocks neg eax imul xy add eax,yy rol eax,16 mov retyi2,al and eax,not 65535 mov retyf2,eax mov eax,xblocks neg eax add eax,scrwid shl eax,3 mov retsp2,eax mov ebx,src mov edi,dst mov edx,x0 rol edx,16 mov ecx,y0 rol ecx,16 xchg cl,dl mov eax,yblocks mov c1,eax @@l1: mov eax,xblocks mov c2,eax @@l2: mov esi,8 call textureblock add edx,retxf1 adc cl,retxi1 add ecx,retyf1 adc dl,retyi1 add edi,retsp1 dec c2 jnz @@l2 add edx,retxf2 adc cl,retxi2 add ecx,retyf2 adc dl,retyi2 add edi,retsp2 dec c1 jnz @@l1 jmp @@done @@vert: mov eax,yblocks neg eax imul eax,yx add eax,xx rol eax,16 mov retxi1,al and eax,not 65535 mov retxf1,eax mov eax,yblocks neg eax imul eax,yy add eax,xy rol eax,16 mov retyi1,al and eax,not 65535 mov retyf1,eax mov eax,yblocks neg eax imul scrwid inc eax shl eax,3 mov retsp1,eax mov ebx,src mov edi,dst mov edx,x0 rol edx,16 mov ecx,y0 rol ecx,16 xchg cl,dl mov eax,xblocks mov c1,eax @@l3: mov esi,yblocks shl esi,3 call textureblock add edx,retxf1 adc cl,retxi1 add ecx,retyf1 adc dl,retyi1 add edi,retsp1 dec c1 jnz @@l3 @@done: ret endp end