home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!gatech!darwin.sura.net!zaphod.mps.ohio-state.edu!saimiri.primate.wisc.edu!ames!olivea!charnel!rat!zeus!
- From: mneideng@thidwick.acs.calpoly.edu (Mark Neidengard)
- Newsgroups: comp.sys.intel
- Subject: Hyperscalar CPU's
- Message-ID: <1992Dec30.182049.136180@zeus.calpoly.edu>
- Date: 30 Dec 92 18:20:49 GMT
- References: <1992Dec24.182252.15072@u.washington.edu>
- Sender: news@zeus.calpoly.edu
- Organization: Academic Computing Services, Cal Poly San Luis Obispo
- Lines: 64
-
- Here's the $.02 for the "raging" CPU debate =)
-
- One of my favorite arguments for high-scalability microprocessors is
- highly-repetitive operations that can be executed in parallel. Imagine the
- process of finding the determinant of a 3x3 matrix. Assuming two-source
- arithmetic units, and using "expansion by minors", the problem would look
- like this:
-
- | A B C |
- | D E F | = A(EI-HF)+B(DI-GF)+C(DH-GE)
- | G H I |
-
- That is to say, we need to do three sets of: two multiplications, a
- subtraction, and another multiplication; then do two additions. This makes
- a total of 14 operations, each of which would probably execute in more than
- one cycle on current microprocessors. IF, however, we implemented three
- multipliers and three adders, the flow would look more like this:
-
- Cycle Instruction
- ----- ------------
- 0 mul1(E,I,add1a) mul2(D,I,add2a) mul3(D,H,add3a)
- 1 mul1(H,F,add1b) mul2(G,F,add2b) mul3(G,E,add3b)
- 2 add1(a,b,mul1a) add2(a,b,mul2a) add3(a,b,mul3a)
- 3 mul1(a,A,add1a) mul2(a,B,add1b) mul3(a,C,add2a)
- 4 add1(a,b,add2b)
- 5 add2(a,b,out)
-
- Thus, assuming one-cycle execution units, we would have a 6 cycle
- determinant, much faster than the 14 cycles it would take on a
- non-superscalar processor. The process could be sped up even further if we
- throw in a "three-source adder":
-
- 3 mul1(a,A,add4a) mul2(a,B,add4b) mul3(a,C,add4c)
- 4 add4(a,b,c,out)
-
- If we postulate a "three-source multiplier", life looks even better:
-
- 0 mul1(A,E,I,add1a) mul2(B,D,I,add1b) mul3(C,D,H,add1c)
- 1 mul1(A,H,-F,add1a) mul2(B,G,-F,add1b) mul3(C,G,-E,add1c)
- add1(a,b,c,add2a)
- 2 add1(a,b,c,add2b)
- 3 add2(a,b,out)
-
- Thus, we have taken a 14 operation process and turned it into 4. We could
- knock off another cycle if we turned our "three-source adder" into a
- "four-source accumulator", with three sources added to the current value.
- This would obviate the need for cycle 3. Such an architecture would have a
- latency for our determinant of only 3 cycles, and, once set in motion, could
- crank out a determinant every other cycle.
-
- Now, this architecture may be somewhat far-fetched, and perhaps it seems
- like a VERY specific task to find oodles of determinants, but similarly
- "focused" tasks have already made their debut in DSP's. In terms of general
- CPU power, even if you had tasks that only needed one arithmetic unit at
- once, you could run several of them in parallel. This would facilitate
- "multithreading" by making it physically possible to have one or more no-break
- threads executing simultaneously. As for the exotic ALU's, I believe many
- people could come up with inventive ways to use them...
-
- Mark Neidengard "Ilegitimi non carborundum"
- mneideng@cosmos.acs.calpoly.edu "Flocci non facio"
- "I think I'm losing my mind this time "Et tu, Brute?"
- This time, I'm losing my MIND!"
- Beastie Boys
-