home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.unix.wizards
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!cs.utexas.edu!newsfeed.rice.edu!exlogcorp!mcdowell
- From: mcdowell@exlog.com (Steve McDowell)
- Subject: Re: Changing the owner of a process
- Message-ID: <1992Nov23.202603.23026@exlog.com>
- Keywords: process ownership
- Organization: EXLOG, Inc.
- References: <1992Nov17.142837.21252@dale.ksc.nasa.gov> <1992Nov19.220759.1846@exlog.com> <1992Nov23.180757.20627@nuchat.sccsi.com>
- Date: Mon, 23 Nov 92 20:26:03 GMT
- Lines: 75
-
- In message <1992Nov23.180757.20627@nuchat.sccsi.com> steve@nuchat.sccsi.com (Steve Nuchia) writes:
- > In article <1992Nov19.220759.1846@exlog.com> mcdowell@exlogcorp.exlog.com (Steve McDowell) writes:
- >>Why is it going to panic over an inconsistent process count? That's simply
- >>....it should simply alert the operator and re-count things.
- >
- > Three points:
-
- > I'd prefer to have the CSRG's remaining time spent
- > on improving fundamental algorithms.
-
- Actually, so would I. A big difference here is that I'd like to see the
- state of "fundamental algorithms" expanded to include issues such as fault
- tolerance and error recovery. It's too often ignored in practice.
-
- > 2: I may be old-fashioned, but I prefer to have a system panic
- > when it detects a "can't happen" bug. That means something
- > has gone wrong. Under those circumstances, why would you want
- > to trust a piece of recovery code that hasn't been tested in
- > living memory?
-
- First of all, I really don't want a production system with *any*
- code that hasn't been tested "in living memory". Test your code
- before you ship it, the guys over in comp.software-eng will tell
- you how...
-
- What I've said all along is that I'd like for the kernel to try
- and safely resolve the error before panic'ing. Whether that requires
- "voting" by having multiple routines doing recounts in different
- ways and panic'ing if they don't match up, or whether it requires
- some other technique from the traditional realtime/fault tolerent
- world, I don't know. I think this is definately an area for study.
-
- The realtime world has been doing this kind of thing since the 1960's,
- but I keep hearing (in email responses, anyway) that "it can't be
- done" or that it "just isn't safe". It can be made safe. It can
- be made to work. It just takes an investment of committed resources.
-
- > 3: What is the opertator going to do about it, anyway? If one of
- > these counts gets out of whack, somebody who can fix the code needs
- > to know about it, fast.
-
- The term "operator" was meant as a generic term indicating the
- person running the computer, whether that person be a developer, field engineer,
- or backup clerk. And usually, somebody who can "fix the code...fast" isn't
- available in the field. I've said before that in the lab, I'm all for doing
- whatever it takes to trace the bugs.
-
- > Something caused the fault, and there is
- > literally no telling what else may be broken until the cause is found.
-
- Certainly there are some faults that point more towards system corruption
- than others -- panic your way out of those. For those that are recoverable,
- and the key word is *recoverable*, then try. Put in a routine to
- recompute a consistant state. Make that routine pass stringent tests to
- give an acceptable degree of certainty before recovering from a
- particular error. If there's uncertainty, then abort. I'm not saying
- to run with an inconsistant state, I'm saying to build in the
- intellegence to find a consistant state.
-
- > Of course, if one is building a system for binary-only distribution
- > or attempting to provide for non-stop operation, other considerations
- > apply. BSD Unix does not have those design goals.
-
- Most systems outside of development labs are binary-only. Most commercial
- users would definately appreciate non-stop operation. I think it's the
- responsibility of those building systems to strive towards the goal of
- non-stop operation. Error-less code is never going to happen. Error-tolerance
- can.
-
- Just my humble opinion.
- --
- Steve McDowell . . . . o o o o o Opinions are
- Exlog, Inc. _____ o mine, not my
- mcdowell@exlog.com _____==== ]OO|_n_n__][. employers..
- [_________]_|__|________)<
-