Thanks to Tony Sanders
Internet connection line failures (dealing with telco and provider)
Routing problems
General network problems (site x.y.z is down and the user complains to you)
Catastrophic Disk failure (make sure you have backups)
Users deleting files and then wanting them back (backups again)
Modem problems all of sorts (parity mismatch, hung modems, flow control,
not disconnecting properly, not connecting properly)
System crashes and hangs
Configuring ftp, telnet, shell accounts, nntp, www, smtp/sendmail, nfs
Bug tracking (a potentially big problem)
Billing
Ideally, there should be someone around 24 hours a day to make sure the
system is put back up when it dies. A reasonable compromise for companies
that aren't large enough to do this is to be sure someone's on the system
during peak usage hours, to make sure the system is reset when it goes down.
(Information on devices to reset the system automatically upon a crash
should be given here).
You may be able to prevent many system crashes - or at least minimize
their effect - by purchasing an Uninterruptable Power Supply (UPS) and
a mechanism to automatically shut the system down gracefully upon power
failure. Unfortunately, this is another issue I'm not that familiar with;
expertise from those who know would be appreciated.
Walter Vose Jeffries
"What we did with our systems here is bought cheap ($ 500) but
reasonably good UPSs and then replaced their batteries with deep cycle
marine/truck batteries. This gives us 8 to 16 horus of run time on
the battery (which then takes 4 days to recharge completely). This
works well and is cheap; the batteries cost $ 60 to $ 100 or so.
That's much less than the cost of a big UPS and by having several that
each give long power protection we are even better off than with one
central unit. In seven years, our longest outage was 7 hours with
three others at about 4 hours and most 5 minutes to one hour. This
way, we don't worry about having to shut down since we know we'll
outlast even the worst power outages we've had in seven years. (Of
course we still program in graceful shut down - but better to never
have it happen)."
I would have loved to have had such a thing during our 1994
earthquake, where power was out in many places for as much as a day or
two. Judicious manipulation of multiple deep cycle batteries probably
would have kept my system going after switching off all non-essential
equipment. For example, if I'd had a news machine at the time, I
could have deactivated it and used its separate UPS when the main
system was almost out of power. Walter tells me that you can also
charge batteries during long outages by just sticking them in a car,
or you can start your car by "borrowing" one of the UPS batteries!
During my current business trip, when I've had to be away from the system
for almost two weeks, I asked my roommate's girlfriend to check the
system and reset it when "nothing appears when you hit the
When part of your network link fails, you're dead in the water.
If you're using a SLIP connection for your network link, you will find
yourself disconnected occasionally. I have written a program to automatically
reconnect myself when this happens. It runs every 15 minutes (through
cron) and checks to see if there is a DIP process running. If there is
no DIP running, it starts one up. This lets my system automatically reconnect
even when I'm not there to tend it. Here's the program:
/* dipcheck.c -- check to see if DIP is running
By David H Dennis * david@amazing.amazing.com
This program is hereby placed in the public domain; no warranty
exists, expressed or implied.
*/
#include
#define FN "/tmp/dipps"
#define LEN 100
main()
{
char s[LEN];
int ct = 0;
FILE *fp;
system("ps -aux | grep dip >/tmp/dipps");
fp = fopen(FN, "r");
while (fgets(s, LEN, fp)) {
ct++;
}
fclose(fp);
/* The PS and DIPCHECK commands also contain the word 'dip', so
if there are less than 3 uses of the word found by grep, we're not
connected, and an attempt should be made to reattach ourselves */
if (ct < 3) {
printf("Executing DIP ...\n");
system("/user/dip/dip /user/dip/sample.dip >/tmp/dipout");
/* Note: Hollywood is the name of my network connection */
system("route add hollywood");
system("route add default gw hollywood");
}
}
Someone will probably flame me for writing this in C, when it would
have been more elegant as perl or even a shell script, but who has
time to learn them? :-(
According to Tony Sanders
If a 56k or T1 connect fails, Tony Sanders
Rackmount modems really score here; as mentioned previously, there are
complex diagnostics and re-routing systems built into the modems. If,
as seems more likely, you have a bunch of tangled wires leading to
heaps of external modems, you will have to find out which one is causing
the problem and reset it. This can usually be done easily enough by
switching it off and on.
If you are calling the system from a remote site, and find it rings and
rings thanks to a bum modem, you can transfer to the next line by calling
the main number on your voice line, and then calling the same number on
your data line. You should then get the next line on the rotary, which
is (hopefully) active. Then, it's a fairly simple matter to inspect ps,
find the runaway job, and kill it. Usually that will reset the modem,
and the system will once again work. (Again, Walter Vose Jeffries
If you still can't get on, it's recommended that you dial up a backup
account you have on a competing provider and telnet to your system.
I have such an account on both Netcom and Smartdocs (the latter being
a small local provider). This also helps me test customer complaints
about reachability, and problems I may have with my WWW pages and
other services.
As we discussed in the sections on IRC or Lynx, these programs have
some interesting bugs that cause them to "run away", making CPU
usage zoom to no great use. I have devised a Perl script to scan
PS in search of these evil jobs. It consists of two parts: RCHECK
runs RUNCHECK repeatedly.
Because I am still testing and refining the performance of these
programs, I run rcheck from a virtual console, and occasionally
watch its work. Once you're satisfied with it, you can put
rcheck's single pipeline (the first 'system' command) in your crontab
and run it every 15 minutes or so.
Note that this has two separate code segments, (A) and (B). (A)
kills any process that exceeds the CPU time listed that is not
being used by administrative users (remember to put your own name
on the list!). (B) kills any irc or lynx processes that are not
being used by an administrative account. (B) is recommended if
you sell shell accounts. NOTE! You must select one and only one
of (A) or (B) - comment out the other by putting "#"s in front of
its lines.
You may want to run runcheck with the killing parts commented out
to see what tasks it actually kills before using it.
These are my first perl programs, so be gentle with criticism. In
particular, I'm sure rcheck could have been written better without
the system commands.
As always, these programs are freely given to the public domain,
although it would be nice if you kept the credit lines in. Since
I didn't sell these programs for billions of dollars, of course I
accept no responsibility for the consequences of trying 'em out.
rcheck:
#/usr/bin/perl
# rcheck - run runcheck perl script forever
# By David H Dennis
runcheck:
#/usr/bin/perl
# Perl program to process output of PS
# By David H Dennis
system("uptime");
system("date");
while (
Even mighty Netcom, with Reiger-knows-how-many gigabytes of disk,
has run out of space on occasion. So it's not just you. (Bob
Reiger is the owner of Netcom).
Don't let that make you feel complacent, though. There's little you
can do to your users to make them more unhappy about you and your
system.
Bryant Durrell
I might add that you'd better check up on your i-nodes as well as
your overall disk space. Because I didn't, I've lost mail on my
system. Don't let that happen to you!
Off the top of my head (another section that needs to be fleshed out
with some real-world opinions), you should back up your system and
user files daily, probably with a seven-day rotating backup
procedure. I wouldn't be worried about news; lost news tends to be a
self-correcting problem.
Recommendations on backup equipment and procedures would be appreciated
here.
What should I do about them?
My thanks to Alicia Salomon
Most providers will start with a single computer performing all functions,
including mail, news, ftp, www serving and user processing.
Because news flows into the system constantly, and since its processing
can put a significant burden on the system's disks, this is normally
one of the first functions to be transferred to a separate machine. Since
the advent of INN, this is not nearly as much of a problem as it once
was, but this is still sound advice.
Tony Sanders
For an impartial view of PageSat, check out Nick's PageSat web page at
http://www.kfu.com/~nsayer/pagesat/. This page was created by a
PageSat user and clearly explains the pros and cons of the package.
My thanks to annette@acm.org (Annette Thompson) for pointing this out.
An update on PageSat was provided by Kevin Kadow
Other processes often put on dedicated machines include FTP, Gopher
and WWW. FTP in particular can put an enormous strain on system
disks, especially if users are allowed to place popular files in their
own directories. There was an enormous stink created on Netcom when
it was discovered that some users' FTP directories had X-rated
pictures in them, and that they constituted some 60% of the total
bytes downloaded from all of Netcom, causing vast overloads on Netcom's
machines. It might be a very good idea to devote a machine with a
large local disk to the user directories and transfer all the load off
the main system. Unfortunately, this doesn't help ease the strain on
your net connection.
Potential load from Gopher and WWW could be immense, particulary
if massive image files are involved. It would probably be a good
idea to use the FTP machine for user Web and Gopher pages as well.
Tony Sanders notes the following: "Well, the real point of load comes
from how popular the information is. The servers of the
Shoemaker-Levey comet photos got creamed as thousands and thousands
of people requested the pictures. The good news is that you can
charge some serious money to local business to put up information on
WWW. That alone could probably pay for a T1 line."
Multi-User Dungeons, or MUDs, are "virtual world" games that account
for a large percentage of the Internet's popularity. Karl Denninger
Note that telnetting TO MUDs is a very easy thing to do; operating
one on your system is the complex and compute-intensive burden Karl's
taking about here.
Karl Denninger
Most of us, of course, would like nothing more than to follow that
example; the only problem is that our checking accounts are
suspiciously bare.
Craig Warner
Most of us don't have that kind of money, either.
The key to having a cut-rate news server is short expiration times.
Craig tells me a SS1 with 32MB RAM and 2-2GB drives would do fine, if
you expire your news in about 2-3 days and have a reasonable number of
readers. If you have about 5 or so readers (which would be
appropriate for up to 20 lines, most likely), you could get away with
that system and a 14 day expiration time (although you might need more
disk space than that).
Once you get more readers, you need a more powerful news machine.
One interesting alternative has been proposed: piggyback off someone
else's news machine. This has the advantage that you don't need to go
through the disk-grinding agonies of keeping news. It has the
disadvantage that your precious bandwidth to the Internet is being
used by people reading news. It's also likely to be slower for the
user than having news at your site.
There are at least two companies that will provide this service:
alt.net : Contact Chris Caputo (ccaputo@alt.net) for information.
texas.net: Contact barron@texas.net (Jonah Yokubaitis). $ 50/month,
$ 0.20/user.
dgs.dgsys.com: Justin Newton
inquo.net: Contact info@inquo.net. News hosting: $ 100/month + $ 1/user.Full (10k+newsgroups) newsfeed or hosting on a *very* fast server with
a T1 line for high speed news access. A full newsfeed is also offered
for $ 70/month.
dbtech.net: Contact dbrass@dbtech.net. News hosting $ 40/month + $
0.20/user, $ 0.44/user for Clarinet. Carries alt, rec, talk and sci
groups minus "questionable" ones [in Alabama] including alt.binaries
and probably most of the sex stuff. This might be good for people who
are strongly anti-porn or believe the Exon stuff is the wave of the
future. Note, however, that no newsfeed service can guarantee
fully filtered content.
Write or obtain an idle timeout program. Usually the archives for
your operating system will have something that will do. For Linux,
ftp to Sunsite.unc.edu and get /pub/Linux/system/Admin/idleout.tar.Z.
There is a certain degree of controversy over how long the idle timeout
should be. Netcom uses 10 minutes, which many people find too short.
MCS uses 20 minutes, which is probably about right.
I think it would be a good idea to vary the idle timeout depending on
the number of lines in use. During an extremely light load time, it
might be ok to make it as much as an hour. This can help users who
have to go to the bathroom or who got engaged in a long conversation,
and it doesn't much hurt the system. However, I have not yet
experimented with the idle timeout software.
Tony Sanders
What about users who seem to be on the system 24 hours a day, 7 days
a week? This leads us to the controversial question of pricing.
Historically, Internet service providers have charged a fixed fee
per month, regardless of the amount of use made on the system. This
has almost always been the best model for customers; all but the
lightest users pay less than they would under the non-fixed schemes.
Unless they have unique offerings like the slick graphical interface
of NYC's Pipeline, new providers are not going to be able to come in
at higher fees than (say) Netcom or CRL. This pretty much eliminates
the idea of hourly fees for most.
Well, maybe. Draper Kauffman (draperk@io.com) notes that Netcom is in
his area at their normal rates. Despite having higher rates, his
system and other local competitors have not suffered; the reason, of
course, is Netcom's infamously terrible service. He suggests that
excellent service can still get $ 25/month or more.
Hourly fees are mandatory, of course, if you offer continental
US toll-free access. This can often be arranged at very competitive
hourly cost as compared to a toll call to your site.
Some services, particulary bulletin boards, undercut the typical
ISP monthly rate but restrict access to a certain amount of time
each day. The Pipeline offers a set number of "free" hours and
charges for any longer period of time spent online.
Unfortunately for those of us who want to provide unlimited accounts,
the growing popularity of SLIP/PPP makes it very difficult to stick to
our guns. SLIP/PPP accounts are unobtrusively there; the customer's
computer is part of the Internet, and the most natural thing in the
world for many users is to just dial in to the system and forget it;
use their computer normally and access the Internet when they feel
like it. SLIP email programs can be told to check for mail every five
minutes; that's way below what any sensible idle timeout would be, so
the effect is for the SLIP user to be on the system 24 hours a day, 7
days a week.
There are three basic approaches to dealing with this problem:
* Have a very long time limit on the account, usually around 150 hours
a month. This forces people to keep track of the time they spend on
the system.
* Have a policy that says that you monitor excessive usage, defined as
being online and not doing anything actively other than routine mail
checks. People who have used this policy report that most people who
are told that they need a dedicated (circa $ 100-175/month) account
will get one.
* Charge by the minute from the first second of use on. Most users
hate this idea.
If you're a single individual, how can you hope to deal with system
problems in an expeditious manner? Bryant Durrell
"Since you probably won't have someone monitoring 24 hours a day,
you'll need some sort of notification of urgent problems short of
users calling you in the middle of the night. One solution is a
beeper. If you have a spare modem and a beeper, it's possible to
write a simple syscheck script that beeps you whenever something goes
critically wrong."
This is an issue for providers that presently have employees, so
I will let someone answer this who is in a better financial condition
than me. :-)
Next section: Who needs and wants Internet Services? How can we reach them?
11.1 What can be done about System Crashes?
11.2 What can be done about Network Outages?
11.3 Hung Modems
11.4 Killing Runaway Processes
11.5 The Dreaded Disk Space Crunch
11.6 What would be a good backup policy
11.7 What services are particulary hard on performance?
11.8 What sort of hardware should I use for my news system?
11.9 What can be done about users who walk away from the keyboard?
11.10 What can be done about users who never log out?
11.11 What about people who stay on their SLIP account forever?
11.12 Monitoring Your System
11.13 Trouble Ticket Systems (*)