NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / parallel / 2840 < prev next >

Wrap

Text File | 1992-12-30 | 11.2 KB | 217 lines

Newsgroups: comp.parallel Path: sparky!uunet!zaphod.mps.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!gatech!hubcap!fpst From: Steven Ericsson Zenith <zenith@kai.com> Subject: Re: Linda / The Parform Message-ID: <1992Dec30.212742.24207@hubcap.clemson.edu> Sender: fpst@hubcap.clemson.edu (Steve Stevenson) Organization: Clemson University Date: Wed, 30 Dec 92 09:25:27 -0600 Approved: parallel@hubcap.clemson.edu Lines: 205 Well, regular readers of comp.parallel over the years will know that this isn't the first time my idiosyncratic use of the english language has gotten me into trouble. Colorful, and perhaps inappropriate at times, it is a reflection of me, my background and culture (or lack of it) and I do not apologize for it. I never -ever- have malicious intent in my remarks - I am sorry if anything I say should be interpreted so. I do, however, comment on areas in which I have some knowledge or have experience undertaking similar experiments. I may criticize the work of others in my field but such criticism should not be taken as an attack on the person - though I understand the temptation to initially interpret it so. I hope I am as gracious when receiving criticism of my own work. It too has strengths and weaknesses and that work and postings here over the years have revealed both - I have no shame of either. I have read the "old" report to which Dr.Cap refers and as he points out it does not answer the questions raised in my posting. I look forward to reading his coming publication. I will also note that the ability to fund a project has never in my experience been an accurate reflection of its worth, most especially in Europe. There are just a couple of points that need clarifying. When I referred to use of the same base compiler in these experiments I meant *the same*. Using different C compilers with the same switches is no guarantee of the same code generation. Therefore, if different compilers were used I do expect to see some variation. Depending on the problem the variation may or may not be important. For example, in the graph presented in the referred paper a time of 1370.6 is reported under the table heading "Linda" and the table heading "PVM" and "PARFORM". The same number appears in the recent posting for Yale (SCA) Linda. Now, the last time I looked Yale (SCA) Linda had an integrated compiler while PVM and POSYBL were, as libraries, able to use a native compiler and there I expect to see a variation of some kind - even a small one (The source of this single number becomes clear in the paper and will be revealed in a following paragraph). But whatever the case the central point is this: the actual compiler used, the maker and version number, was not reported - it is not good enough to say "all platforms use the C language and C compiler with identical compiler options": this was the same C compiler? The native Sun C compiler? The GNU C compiler? Which? I am particularly interested in your base time because it, and not any caching artifact, may be the source of your superlinear speed up. I'm not saying that it is so, I'm simply asking the question. However, you might like to consider the following scenario, a reinterpretation of your figures, that is the source for my concern: Here are the original numbers. > Procs POSYBL SCA linda PVM MC-2 The Parform > 1 1370.6 1370.6 1370.6 -- 1370.6 (1.0)* > 2 737.2 662.2 648.0 -- 654.8 (2.1) > 4 442.6 342.6 328.0 921.4 332.3 (4.1) > 6 339.3 235.5 219.0 618.5 221.7 (6.2) > 8 284.6 175.8 168.4 466.7 170.2 (8.0) > 10 260.2 144.3 143.6 376.6 137.4 (10.0) > 12 244.7 122.1 116.6 318.2 116.0 (11.8) > 14 242.7 104.5 100.1 276.2 103.5 (13.5) > 16 239.5 92.8 90.0 240.0 89.0 (15.4) > 18 242.6 84.5 97.5 215.9 80.9 (16.9) > 20 241.6 76.0 85.8 196.6 73.5 (18.7) > 22 71.5 68.5 182.8 67.5 (20.3) > 24 66.5 63.6 170.9 62.5 (21.9) > 26 63.1 60.5 160.9 58.6 (23.4) > 28 58.5 56.7 151.9 55.8 (24.7) > 30 55.1 53.5 144.7 53.0 (25.9) > 32 54.0 54.0 138.5 51.0 (26.9) > 34 52.4 54.0 50.8 (27.0) > 36 51.4 52.0 48.5 (28.3) > 38 51.3 54.0 48.4 (28.3) > 40 52.9 47.2 (29.0) > * speedup in parentheses I assume speedup is computed using the same method for the graph on page 11 of the "old" paper. There it is stated that the speed up is "with respect to the sequential run on the slowest workstation in our LAN, a Sparcstation1." This is justified later by the comment "In the case of homogeneous partioning [of the problem] the slowest machine in the net is responsible for the speedup. All faster machines idle while waiting at the synchronization points, which are the communication statements. Thus it is reasonable to calculate speedup values with respect to the runtime on the slowest machine." I understand the point here but I'm not convinced by the reasoning. I would disagree with the paper on this point arguing that the base time for a parallel speedup measure must be the fastest possible sequential time not the slowest, regardless of the disparate performance of the relative processing elements. By your reasoning I can sit a Cray on the network (which I'll assume here can do the task in zero time) and still we will see your "speed up" yet if I randomly walk up to your network and sit at the Cray I will not be able to prove your case - however, let us move on. Reconsider the figures with the base numbers modified according to the time give for 2 processors times 2. So, this allows that each system, for this problem, speeds up perfectly for 2 processors (which it probably doesn't) but I'll argue it may give me a fair measure of the sequential code generation of each compiler and is equally fair to each system implementation. Not surprisingly we see the variation I expected. To save time permit me to focus on the Yale (SCA) Linda, PVM and PARFORM numbers. Procs SCA linda PVM The Parform 1 1324.4(1) 1296.0(1) 1309.6 (1)* 2 662.2(2) 648.0(2) 654.8 (2) [2.1]** 4 342.6(3.86) 328.0(3.95) 332.3 (3.94) [4.1] 6 235.5(5.62) [5.82] 219.0(5.92) [6.26] 221.7 (5.9) [6.2] 8 175.8(7.53) [7.8] 168.4(7.70) [8.14] 170.2 (7.69) [8.0] 10 144.3(9.18) [9.5] 143.6(9.03) [9.54] 137.4 (9.53) [10.0] 12 122.1(10.8) [11.23] 116.6(11.11)[11.75] 116.0 (11.29)[11.8] 14 104.5(12.67) 100.1(13) [13.6] 103.5 (12.7) [13.2] 16 92.8(14.27) 90.0(14.4) 89.0 (14.71)[15.4] 18 84.5(15.67) 97.5(13.29) 80.9 (16.19)[16.9] 20 76.0(17.43) 85.8(15.01) 73.5 (17.82)[18.7] 22 71.5(18.52) 68.5(18.92) 67.5 (19.40)[20.3] 24 66.5(19.92) 63.6(20.38) 62.5 (20.95)[21.9] 26 63.1(20.99) 60.5(21.42) 58.6 (22.35)[23.4] 28 58.5(22.64)[23.4] 56.7(22.86)[24.2] 55.8 (23.47)[24.7] ... * speedup in parentheses ** old numbers in square brackets. These numbers draw a slightly different graph - not greatly different I'll admit but, significantly, there are no superlinear artifacts. Let me make it clear that this is a fiction because we do not have the real base times but I will argue that it is a legitimate interpretation of the numbers. I like these numbers better since there is no superlinear speedup (well you knew there wouldn't be) - and they illustrate the effect of a small change to the respective base times. Imagine now the result if we had taken the SPARCstation2 base time (564) given in the paper - negative speedup on 2 processors - which may, in fact, be the right information to give a potential user with this application whose network configuration is two SPARC1s and one SPARC2. I did read the paper. I understand the issues surrounding swapping, I know, as do most people by now, the effective increase in localized memory and cache size can produce superlinear artifacts - I'm just not convinced these numbers illustrate that with the 3800x100 problem size these numbers are for. These are not trivial points either since they have ramifications in the subsequent analytical conclusions presented in the paper. We certainly should not interpret from these numbers that the PARFORM implementation performs better than the other two. The Linda compiler may still be a version of an old GNU C compiler (it was the last time I -2+ years ago- dug into its guts) and thus its code generation may not be quite as good as the compiler used in the PVM and PARFORM case - we don't know since we don't know the compiler used - it may be enough to remove any disparity shown above. I am not suggesting that the numbers were in any way faked, I am questioning the use, attribution and validity for the stated cause, of the numbers you have and I welcome clarification. I expect other peer review would do the same. The stated cause is important here for my comments were aimed at the contention made in the earlier reposting of the numbers that they (the numbers) "heated up the discussion whether the Linda or message passing paradigm are superior" and I am particularly interested in the comparison of parallel programming *models*. The same implication is made in the paper. However, the explicit claim is different. The explicit claim is (pg18) "The presented results and experiences with our platform show that we can regard a workstation network as a tightly coupled multiprocessor system when exploiting its resources with a system like the PARFORM." with which I doubt anyone in this field would disagree. But let us be clear, these numbers do not tell us anything about the intrinsic merit of the models involved. They tell us something of the specific implementations, in this case I believe they tell us more about parallel decomposition of this problem: pretty much any interaction model will do, new scheduling techniques are important and applicable to all implementations. To emphasize: comparing a poor implementation of Linda to any implementation of Message Passing and vice versa reveals no intrinsic knowledge about the relative merit of each model. And so I must return to my earlier remarks and restate them: for a valid comparison of the models to be made the implementations must be shown to be comparable. The base compiler must be the same; i.e., the sequential code generation must be equivalent. The implementation techniques of the model must be comparable; i.e., process creation, synchronization, interaction implementations must be understood at the lowest level and be shown to be comparable. Further more, the problems run should exercise the primitives of the model. A straight comparison of a program running under Yale (SCA) Linda and PVM is not at all valid in this sense; just as the earlier comparison with POSYBL illustrated (yes, though I rest in silence, I do still read this news group). The above comments should not be deemed to show favor by me to either message passing or Linda models, anyone knowing my work will know that I have arguments against both for general purpose parallel programming beit in shared or distributed environments. I have solutions but *that* is for another time. Peace and balance to you all in this coming New Year. -- Steven Ericsson Zenith Disclaimer: Opinions expressed are my own and not necessarily those of KAI.