home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!olivea!hal.com!darkstar.UCSC.EDU!osr
- From: grichard@cis.ohio-state.edu (Golden Richard)
- Newsgroups: comp.os.research
- Subject: Summary of responses to Unix checkpoint request
- Message-ID: <1jmq5sINN87g@darkstar.UCSC.EDU>
- Date: 21 Jan 93 18:31:24 GMT
- Organization: The Ohio State University Dept. of Computer and Info. Science
- Lines: 130
- Approved: comp-os-research@ftp.cse.ucsc.edu
- NNTP-Posting-Host: ftp.cse.ucsc.edu
- Originator: osr@ftp
-
-
-
- The response to my request for Unix checkpoint/restore packages was
- overwhelming, with the great majority of the responses pointing to
- Condor, available by ftp from ftp.cs.wisc.edu. Condor works entirely
- at user level [no kernel modifications required] but doesn't currently
- support interprocess communication, signals, or fork(). Definitely
- worth a look. With some modifications, Condor may be what I need.
-
-
- Some of the non-Condor responses follow (apologies in advance to the
- authors for the crude editing):
-
- ****************
- Bennet S. Yee
-
- I have a mostly portable implementation of just that which I wrote
- around 6 years ago. When you invoke the checkpoint procedure, it
- saves the state to a file; when you start up a second process with the
- same program (but with different arguments) which calls the restore
- procedure, it reads the old state from the file. What is restored:
- data, stack, and all registers -- the restore procedure causes the
- restoring process to change its state so that the code appears to be
- returning from the save routine but with a different value, much as
- longjmp/setjmp does. Even the arglist and the environment is changed
- back to that of the original invocation. Unix process state that is
- NOT restored: I/O descriptors and their state, current working
- directory, resource limits.
-
- You should be able to anonymous ftp this package from my machine,
- play.trust.cs.cmu.edu, from the directory /usr/bsy/pub/. The package
- is in save_world.shar.Z, and it is known to work for Pmaxen, Sun4's,
- Sun3's, IBM RTs, and Vaxen. Porting it to a new architecture should
- be relatively simple -- look at the README file.
-
- Lemme know what you do with it.
-
-
- ********************
- charlie shub
-
- i'd look at jefferson's Time Warp Operating System or contact Jade
- Simulations in Calgary who have implemented such a beastie.
-
-
- **********************
- Jian Xu
-
- Stuart Feldman and Channing Brown's paper "IGOR: a system for program
- debugging via reversible execution" in the 1st Worshop on Parallel
- and Distributed Debugging (1988) described such system (the proceeding
- also has been published as in SIGPLAN). You may find the following paper
- is also interesting.
-
- @INPROCEEDINGS{kaili:concurrentckp,
- AUTHOR = "Kai Li and J. F. Naughton and J. S. Plank",
- TITLE = "Real-time, concurrent checkpointing for parallel
- programs",
- BOOKTITLE = "Proc. 2nd ACM SIGPLAN Symp. on Principles and Practice
- of Parallel Programming",
- PAGES = "79--88",
- MONTH = mar,
- YEAR = 1990
- }
-
-
- I don't know whether there is any production tool and/or free tool for
- doing checkpointing and I am interested in learning if you have any points.
-
-
-
- *************
- Andrew Mullhaupt
-
-
- In modern Unices, a process is a file (often in /proc) and you can
- completely record the state of the process via this file.
-
- If you're just looking to checkpoint a process it's usually easiest
- to use memory mapped files for all your data as opposed to using
- malloc/free. You can easily write your own file backed malloc/free
- instead. Then the whole state of your process is checkpointed just
- by doing a msync, which gets all data onto disk. I've used this for
- big jobs which run a long time.
-
-
- *************
- Greg Price
-
- I think that UNICOS (Cray stuff?) has this facility... I think I also saw
- a technical paper.... Look for Winter '88 USENIX Technical Conference in
- San Diego paper call "Job and Process Recovery in a UNIX-based Operating
- System" by Brent Kingsbury and John Kline (brent@yafs.cray.com & jtk@hall.
- cray.com) I haven't read it so I don't know how good or bad it is. I
- seen to remember some people doing some work on this stuff a year or so
- back...was in usenet news.
-
-
- *************
- James Pinakis
-
-
- There's a public domain package available called "pmckpt" (poor man's
- checkpointer). I can't remember exactly where I got it from but an
- archie search on pmckpt should give some results. Here's the readme
- file...
-
- [readme file deleted because of length]
-
-
- *************
- Pete Ware
-
- Two options come to mind: "condor" which has been installed at OSU and
- will work for many programs merely by relinking and "isis" from
- Cornell which replayes IPC messages and has the notion of
- checkpointing. ...
-
-
- *************
- John Bazik
-
- You should look at condor from Univ of Wisconsin. At Brown, we are working
- on our own distributed processing system that we call Quahog. In it, we
- provide a simple checkpoint/restart facility, that does work, but not
- transparently, as condor's does. We have a transparent facility on our
- list of todos.
- --
- Golden Richard III OSU Dept. of Computer and Information Science
- grichard@cis.ohio-state.edu (614) 292-0056
-