NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / os / vms / 20140 < prev next >

Wrap

Text File | 1992-12-31 | 4.3 KB | 91 lines

Newsgroups: comp.os.vms Path: sparky!uunet!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!geraldo.cc.utexas.edu!slcs.slb.com!BRYDON@128.58.42.3 From: brydon@asl.slb.com (Harvey Brydon (918)250-4312) Subject: Re: Mount verification timed out Message-ID: <1992Dec31.115043.1445@slcs.slb.com> Sender: news@slcs.slb.com (News Administrator) Nntp-Posting-Host: 129.87.186.2 Reply-To: brydon@dsn.SINet.slb.com Organization: Schlumberger/Anadrill Sugar Land, TX References: <1992Dec29.152302.1716@das.harvard.edu>,<30DEC199208223472@spades.aces.com> Date: Thu, 31 Dec 92 11:50:43 GMT Lines: 77 In article <30DEC199208223472@spades.aces.com>, system@spades.aces.com (SYSTEM MANAGER) writes: >In article <1992Dec29.152302.1716@das.harvard.edu>, chen@speed.uucp (Lilei Chen) writes... >#In our VAX/VMS-cluster a bunch of disks are cross mounted. Somtimes when >#a node goes down, its disks are marked mount verification timed out. I >#haven't found a way to get those disks remounted without rebooting the >#nodes. I am wondering if someone on the net has a solution for that problem. >#Thanks. > > Short: Increase SYSGEN parameter MVTIMEOUT to the time > it takes one of your nodes to crash and reboot > plus a minute I disagree with this advice. I think you mean the time it takes the slowest node in the cluster to crash and get to the reboot code that re-mounts the disks (plus a minute) but I disagree with that too. I set MVTIMEOUT to the max value on my clusters. More below. > Long: Mount verification is what VMS does to a device > it isn't sure is responding. After a certain > period of time where the device does not complete > mount verification, mount verification timeout > occurs. (This timer is controlled by the parameter > mentioned above). > > If you set the timer short, then as soon as a disk > times out, all IOs to it are returned, no processes > get to issue new IOs to the disk, and nobody hangs. > The bad side of this is that even if the node serving > this device comes back, the device is inaccessible. > > If you set the timer long, then it will survive across > reboots (the device will) but all processes with pending > IOs will hang until timeout. The above 'objectionable' behaviour is not always bad. If it is a backup writing to the disk, for example, it will indeed hang if the disk goes offline, but (for my systems) this is preferable to getting the I/O's returned and aborting the backup of the disk. Other situations apply. Generally, the cleanup for a disk in mount verification is bad enough that I don't recommend putting anything on satellite disks that would be used by anything but that satellite system (except things like backup). If you ever put things like cluster-wide installed images, user directories, etc. a mount verification timeout can sometimes require a reboot of numerous cluster members, or even the entire cluster. Mount verification timeout always requires manual intervention to fix. I haven't found a good way of automating things that can be handled by a command procedure. A disk in mount verification (with no timeout) fixes itself when the system reboots again. Also, you only assume the case of a system crashing and rebooting. The MV timeout would also occur if (say) you shut down the system manually, or it hangs, or the network is disrupted. I recommend the highest possible value for MVTIMEOUT (about 17 hours?). I wish DEC allowed an infinite value for it. Most of my massive cluster reboots occur Monday morning because of this... As George Bush would say: "Mount verification, GOOD" "Mount verification timeout, BAD" I participated in an extended discussion on this on comp.os.vms about a year ago with a few more details. By the way, take note that MV timeout only occurs on a given node when it 'discovers' that a given disk is offline. As far as I know, this only happens when you try to do I/O to the disk. And every node in a cluster has its own opinion of the state of the disk. A disk can be timed out on one node and not on another. > I would try and minimize the crashing of your other nodes... Agreed. _______________________________________________________________ Harvey Brydon | Internet: brydon@dsn.SINet.slb.com Dowell Schlumberger | P.O.T.S.: (918)250-4312 Sorry, but my karma just ran over your dogma!