NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / sys / intel / 2848 < prev next >

Wrap

Internet Message Format | 1992-12-30 | 3.4 KB

Path: sparky!uunet!gatech!darwin.sura.net!zaphod.mps.ohio-state.edu!saimiri.primate.wisc.edu!ames!olivea!charnel!rat!zeus! From: mneideng@thidwick.acs.calpoly.edu (Mark Neidengard) Newsgroups: comp.sys.intel Subject: Hyperscalar CPU's Message-ID: <1992Dec30.182049.136180@zeus.calpoly.edu> Date: 30 Dec 92 18:20:49 GMT References: <1992Dec24.182252.15072@u.washington.edu> Sender: news@zeus.calpoly.edu Organization: Academic Computing Services, Cal Poly San Luis Obispo Lines: 64 Here's the $.02 for the "raging" CPU debate =) One of my favorite arguments for high-scalability microprocessors is highly-repetitive operations that can be executed in parallel. Imagine the process of finding the determinant of a 3x3 matrix. Assuming two-source arithmetic units, and using "expansion by minors", the problem would look like this: | A B C | | D E F | = A(EI-HF)+B(DI-GF)+C(DH-GE) | G H I | That is to say, we need to do three sets of: two multiplications, a subtraction, and another multiplication; then do two additions. This makes a total of 14 operations, each of which would probably execute in more than one cycle on current microprocessors. IF, however, we implemented three multipliers and three adders, the flow would look more like this: Cycle Instruction ----- ------------ 0 mul1(E,I,add1a) mul2(D,I,add2a) mul3(D,H,add3a) 1 mul1(H,F,add1b) mul2(G,F,add2b) mul3(G,E,add3b) 2 add1(a,b,mul1a) add2(a,b,mul2a) add3(a,b,mul3a) 3 mul1(a,A,add1a) mul2(a,B,add1b) mul3(a,C,add2a) 4 add1(a,b,add2b) 5 add2(a,b,out) Thus, assuming one-cycle execution units, we would have a 6 cycle determinant, much faster than the 14 cycles it would take on a non-superscalar processor. The process could be sped up even further if we throw in a "three-source adder": 3 mul1(a,A,add4a) mul2(a,B,add4b) mul3(a,C,add4c) 4 add4(a,b,c,out) If we postulate a "three-source multiplier", life looks even better: 0 mul1(A,E,I,add1a) mul2(B,D,I,add1b) mul3(C,D,H,add1c) 1 mul1(A,H,-F,add1a) mul2(B,G,-F,add1b) mul3(C,G,-E,add1c) add1(a,b,c,add2a) 2 add1(a,b,c,add2b) 3 add2(a,b,out) Thus, we have taken a 14 operation process and turned it into 4. We could knock off another cycle if we turned our "three-source adder" into a "four-source accumulator", with three sources added to the current value. This would obviate the need for cycle 3. Such an architecture would have a latency for our determinant of only 3 cycles, and, once set in motion, could crank out a determinant every other cycle. Now, this architecture may be somewhat far-fetched, and perhaps it seems like a VERY specific task to find oodles of determinants, but similarly "focused" tasks have already made their debut in DSP's. In terms of general CPU power, even if you had tasks that only needed one arithmetic unit at once, you could run several of them in parallel. This would facilitate "multithreading" by making it physically possible to have one or more no-break threads executing simultaneously. As for the exotic ALU's, I believe many people could come up with inventive ways to use them... Mark Neidengard "Ilegitimi non carborundum" mneideng@cosmos.acs.calpoly.edu "Flocci non facio" "I think I'm losing my mind this time "Et tu, Brute?" This time, I'm losing my MIND!" Beastie Boys