home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.sys.intel:2798 comp.arch:11821
- Newsgroups: comp.sys.intel,comp.arch
- Path: sparky!uunet!think.com!enterpoop.mit.edu!bloom-picayune.mit.edu!athena.mit.edu!solman
- From: solman@athena.mit.edu (Jason W Solinsky)
- Subject: Re: Superscalar vs. multiple CPUs ?
- Message-ID: <1992Dec21.133318.2975@athena.mit.edu>
- Sender: news@athena.mit.edu (News system)
- Nntp-Posting-Host: m4-035-4.mit.edu
- Organization: Massachusetts Institute of Technology
- References: <WAYNE.92Dec4093422@backbone.uucp> <37595@cbmvax.commodore.com> <PCG.92Dec13170504@aberdb.aber.ac.uk>
- Date: Mon, 21 Dec 1992 13:33:18 GMT
- Lines: 83
-
- In article <PCG.92Dec13170504@aberdb.aber.ac.uk>, pcg@aber.ac.uk (Piercarlo Grandi) writes:
- |> On 10 Dec 92 00:29:51 GMT, solman@athena.mit.edu (Jason W Solinsky) said:
- |> Nntp-Posting-Host: m4-035-15.mit.edu
- |>
- |> solman> (Piercarlo Grandi) writes:
- |> |> (Bernard Gunther) said:
- |>
- |> No, actually I (pcg) said this:
- |>
- |> pcg> Well, certain tricks can also be used with multiple CPUs on the
- |> pcg> same die. And these have an important advantage: as far as I can
- |> pcg> see, 6 instruction issue per cycle is virtually pointless. The
- |> pcg> *limit* of superscalarity present in general purpose codes is 4,
- |> pcg> and actually we are hard pressed to find many codes with
- |> pcg> superscalarity higher than 2.
- |>
- |> solman> Err, I believe those are single threaded codes. If you're only
- |> solman> dealing with one thread at a time, then you might as well not
- |> solman> bother doing putting multiple CPUs on a die either.
- |>
- |> Precisely my point: single threaded *general purpose* codes have a
- |> limited intrinsic degree of exploitable parallelism.
-
- And general purpose computing does not, in general, involve single threaded
- codes. At every level of abstraction, codes can be parallelized further. From
- multi-tasking down to ILP. If you define the codes which we are concerned
- with to be codes which can only exploit ILP, then of course the level of
- parallelism is limited, but you are not dealing with general purpose computing
- anymore.
-
- |> solman> It also sounds to me like you are looking at a very narrow
- |> solman> subset of programs. If this were true then modern heavily
- |> solman> pipelined uPs would be horribly inefficient.
- |>
- |> Indeed pipeline designs with more than a few stages of pipelining run
- |> into huge problems, and are worth doing only if a significant proportion
- |> of SIMD-like operation is expected. Pipeline bubbles start to become a
- |> significant problem beyond 4 pipeline stages on general purpose codes,
- |> even on non superscalar architectures.
-
- This can be taken care of by interleaving different threads in the software,
- or using hardware which will take care of the interleaving on its own. The
- above statement is only true when the compiler is too dumb to notice higher
- level parallelism.
-
- |> solman> A lot of hardware is freed up by breaking the "imaginary" lines.
- |>
- |> Uhmmm, what I read is that almost all the space of modern 1-3 million
- |> CPU chips is taken up by caches and register files.
-
- Both of which are excellent examples of how breaking the imaginary lines can
- improve performance. The key question in chosing how large register files
- and caches should be, is "How large a {register file or cache} do I need
- for `good' performance on the algorithms I want to run?" Invariably, the
- size chosen is too small some of the time, while much of it is left unused
- at other times. In the multiple CPU version, this still happens. In the
- hyperscalar version, however, some of the execution units and threads will
- need a larger {cache or reg file} and some will be unable to utilize the
- existing space, but because they can share the same caches and register files,
- it is far less likely for performance to be limited by cache or register file
- size. In real life, instead of resulting in a greater level of performance,
- this would likely result in smaller caches and register files that produce
- the same performance level. Thus the additional space.
-
- |> solman> This can then be used to re-implement the lost abilities, like
- |> solman> context switching, but in a manner that allows them to be used
- |> solman> by all the parts of the chip.
- |>
- |> Again, my impression that designing a multithreaded CPU looks already a
- |> daunting task implies that having the CPU threads execute in different
- |> CPU and MMU contexts (thus having many virtual interrupt vectors,
- |> register files, traps, and instruction restart/resume on faults) looks
- |> harder still. Maybe somebody has already done research along these
- |> lines; maybe it is feasible, and maybe it is even cost effective. I can
- |> only remember old, very limited, and not too successful examples.
-
- Its an open question on how you want to do it. I personally favor tagging
- everything, and avoiding the concept of contexts in a dataflowish fashion.
-
- As far as whether its cost effective and feasible, if you see it two years from
- now, its a good idea now. Otherwise, its not. :-)
-
- Jason W. Solinsky
-