NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / unix / wizards / 4862 < prev next >

Wrap

Text File | 1992-11-23 | 4.1 KB | 87 lines

Newsgroups: comp.unix.wizards Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!newsfeed.rice.edu!exlogcorp!mcdowell From: mcdowell@exlog.com (Steve McDowell) Subject: Re: Changing the owner of a process Message-ID: <1992Nov23.202603.23026@exlog.com> Keywords: process ownership Organization: EXLOG, Inc. References: <1992Nov17.142837.21252@dale.ksc.nasa.gov> <1992Nov19.220759.1846@exlog.com> <1992Nov23.180757.20627@nuchat.sccsi.com> Date: Mon, 23 Nov 92 20:26:03 GMT Lines: 75 In message <1992Nov23.180757.20627@nuchat.sccsi.com> steve@nuchat.sccsi.com (Steve Nuchia) writes: > In article <1992Nov19.220759.1846@exlog.com> mcdowell@exlogcorp.exlog.com (Steve McDowell) writes: >>Why is it going to panic over an inconsistent process count? That's simply >>....it should simply alert the operator and re-count things. > > Three points: > I'd prefer to have the CSRG's remaining time spent > on improving fundamental algorithms. Actually, so would I. A big difference here is that I'd like to see the state of "fundamental algorithms" expanded to include issues such as fault tolerance and error recovery. It's too often ignored in practice. > 2: I may be old-fashioned, but I prefer to have a system panic > when it detects a "can't happen" bug. That means something > has gone wrong. Under those circumstances, why would you want > to trust a piece of recovery code that hasn't been tested in > living memory? First of all, I really don't want a production system with *any* code that hasn't been tested "in living memory". Test your code before you ship it, the guys over in comp.software-eng will tell you how... What I've said all along is that I'd like for the kernel to try and safely resolve the error before panic'ing. Whether that requires "voting" by having multiple routines doing recounts in different ways and panic'ing if they don't match up, or whether it requires some other technique from the traditional realtime/fault tolerent world, I don't know. I think this is definately an area for study. The realtime world has been doing this kind of thing since the 1960's, but I keep hearing (in email responses, anyway) that "it can't be done" or that it "just isn't safe". It can be made safe. It can be made to work. It just takes an investment of committed resources. > 3: What is the opertator going to do about it, anyway? If one of > these counts gets out of whack, somebody who can fix the code needs > to know about it, fast. The term "operator" was meant as a generic term indicating the person running the computer, whether that person be a developer, field engineer, or backup clerk. And usually, somebody who can "fix the code...fast" isn't available in the field. I've said before that in the lab, I'm all for doing whatever it takes to trace the bugs. > Something caused the fault, and there is > literally no telling what else may be broken until the cause is found. Certainly there are some faults that point more towards system corruption than others -- panic your way out of those. For those that are recoverable, and the key word is *recoverable*, then try. Put in a routine to recompute a consistant state. Make that routine pass stringent tests to give an acceptable degree of certainty before recovering from a particular error. If there's uncertainty, then abort. I'm not saying to run with an inconsistant state, I'm saying to build in the intellegence to find a consistant state. > Of course, if one is building a system for binary-only distribution > or attempting to provide for non-stop operation, other considerations > apply. BSD Unix does not have those design goals. Most systems outside of development labs are binary-only. Most commercial users would definately appreciate non-stop operation. I think it's the responsibility of those building systems to strive towards the goal of non-stop operation. Error-less code is never going to happen. Error-tolerance can. Just my humble opinion. -- Steve McDowell . . . . o o o o o Opinions are Exlog, Inc. _____ o mine, not my mcdowell@exlog.com _____==== ]OO|_n_n__][. employers.. [_________]_|__|________)<