home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: comp.os.vms
- Path: sparky!uunet!zaphod.mps.ohio-state.edu!usc!cs.utexas.edu!geraldo.cc.utexas.edu!slcs.slb.com!BRYDON@128.58.42.3
- From: brydon@asl.slb.com (Harvey Brydon (918)250-4312)
- Subject: Re: Mount verification timed out
- Message-ID: <1992Dec31.115043.1445@slcs.slb.com>
- Sender: news@slcs.slb.com (News Administrator)
- Nntp-Posting-Host: 129.87.186.2
- Reply-To: brydon@dsn.SINet.slb.com
- Organization: Schlumberger/Anadrill Sugar Land, TX
- References: <1992Dec29.152302.1716@das.harvard.edu>,<30DEC199208223472@spades.aces.com>
- Date: Thu, 31 Dec 92 11:50:43 GMT
- Lines: 77
-
- In article <30DEC199208223472@spades.aces.com>, system@spades.aces.com (SYSTEM
- MANAGER) writes:
- >In article <1992Dec29.152302.1716@das.harvard.edu>, chen@speed.uucp (Lilei Chen) writes...
- >#In our VAX/VMS-cluster a bunch of disks are cross mounted. Somtimes when
- >#a node goes down, its disks are marked mount verification timed out. I
- >#haven't found a way to get those disks remounted without rebooting the
- >#nodes. I am wondering if someone on the net has a solution for that problem.
- >#Thanks.
- >
- > Short: Increase SYSGEN parameter MVTIMEOUT to the time
- > it takes one of your nodes to crash and reboot
- > plus a minute
-
- I disagree with this advice. I think you mean the time it takes the slowest
- node in the cluster to crash and get to the reboot code that re-mounts the
- disks (plus a minute) but I disagree with that too. I set MVTIMEOUT to the
- max value on my clusters. More below.
-
- > Long: Mount verification is what VMS does to a device
- > it isn't sure is responding. After a certain
- > period of time where the device does not complete
- > mount verification, mount verification timeout
- > occurs. (This timer is controlled by the parameter
- > mentioned above).
- >
- > If you set the timer short, then as soon as a disk
- > times out, all IOs to it are returned, no processes
- > get to issue new IOs to the disk, and nobody hangs.
- > The bad side of this is that even if the node serving
- > this device comes back, the device is inaccessible.
- >
- > If you set the timer long, then it will survive across
- > reboots (the device will) but all processes with pending
- > IOs will hang until timeout.
-
- The above 'objectionable' behaviour is not always bad. If it is a backup
- writing to the disk, for example, it will indeed hang if the disk goes
- offline, but (for my systems) this is preferable to getting the I/O's
- returned and aborting the backup of the disk. Other situations apply.
- Generally, the cleanup for a disk in mount verification is bad enough that I
- don't recommend putting anything on satellite disks that would be used by
- anything but that satellite system (except things like backup). If you ever
- put things like cluster-wide installed images, user directories, etc. a mount
- verification timeout can sometimes require a reboot of numerous cluster
- members, or even the entire cluster.
-
- Mount verification timeout always requires manual intervention to fix. I
- haven't found a good way of automating things that can be handled by a command
- procedure. A disk in mount verification (with no timeout) fixes itself when
- the system reboots again.
-
- Also, you only assume the case of a system crashing and rebooting. The MV
- timeout would also occur if (say) you shut down the system manually, or it
- hangs, or the network is disrupted. I recommend the highest possible value
- for MVTIMEOUT (about 17 hours?). I wish DEC allowed an infinite value for it.
- Most of my massive cluster reboots occur Monday morning because of this...
-
- As George Bush would say:
- "Mount verification, GOOD"
- "Mount verification timeout, BAD"
-
- I participated in an extended discussion on this on comp.os.vms about a year
- ago with a few more details.
-
- By the way, take note that MV timeout only occurs on a given node when it
- 'discovers' that a given disk is offline. As far as I know, this only happens
- when you try to do I/O to the disk. And every node in a cluster has its own
- opinion of the state of the disk. A disk can be timed out on one node and not
- on another.
-
- > I would try and minimize the crashing of your other nodes...
-
- Agreed.
- _______________________________________________________________
- Harvey Brydon | Internet: brydon@dsn.SINet.slb.com
- Dowell Schlumberger | P.O.T.S.: (918)250-4312
- Sorry, but my karma just ran over your dogma!
-