[ Abstract ]
[ Copyright Notice ]
[ Contents ]
[ next ]
Smart Cache Manual
Chapter 1 Introduction
Some sections are still missing or inaccurate. If you are out of the luck, see
comments in the sample configurations files or consult sources for more
informations.
This manual has been converted from Smart Cache English homepage to
debiandoc-sgml
format, which allows to generate many output
formats from one source. After conversion, this manual was extended by Radim
Kolar into current form and merged with translated Czech documentation, which
is no longer maintained.
English is not my native language and if you see any errors, just ignore it or
mail me
.
, the Word already was. The Word dwelt with God, and what God was, the
Word was. The Word, then, was with God at the beginning, and through him all
things came to be; no single thing was created without him. All that came to
be was alive with his life, and that life was the light of men. The light
shines on the dark, and the darkness never mastered it.
New Testament, The Gospel according to John, The coming of Christ.
After leaving my job, I start to use modem connection to Internet. It was
slow, but biggest problem for me was a quite high prices payed to the monopoly
Czech telecommunication company SPT Telecom (now renamed to Czech
Telecom, because many people did not know what SPT is). I have find that
I need some useful tool, which allows me to browse WWW pages off line.
I have tried several methods (See Other off
line browsing solutions, Section 1.4) to achieve this goal, but all of them
has some limitations and I found them totally unusable for me. These programs
are not bad, there are just not optimal for this what I wants.
- IBM Internet Connection Server 4.0
-
This is WWW server with built in proxy cache. Proxy cache uses simple
CERN-like directory structure, so it was easy to find cached files. Also proxy
cache has the switch for off line mode, when returns only cached pages.
Biggest problem with this server was, that server was that this server is based
on original CERN http daemon, which was not thread-safe. IBM ported this
daemon into OS/2, but they does not care about this and do not implemented any
locking mechanism to protect thread sensitive data and for thread
synchronization. Server complains very often about locked
.cacheinfo
files and loaded objects was not stored at disk. After
some time IBM has made new version 4.1. This version introduced new HTTP/1.1
support into WWW server and proxy cache. WWW server works with some occasional
crashes, but proxy cache was totally broken. I never managed it running, they
probably do not test this part of their product. After some time IBM abandoned
this server and recommends ICS's users to upgrade to Lotus Domino.
- Mailing pages to myself in Netscape
-
I have found, that in Netscape Navigator is possible to email entire web page.
So I started mailing interesting web to me and browsing it via Sent
Mail folder. This works quite well, off line browsing was possible (even
with embedded pictures). But Netscape do not save pictures into Sent
Mail folder, It saved pictures only into its internal disk cache, so after
expiring pictures, I was unable to see that.
- Using Netscape's internal disk cache
-
Netscape browser has persistent disk cache. This disk cache is able to cache
web objects between sessions and there are couple of programs called
Netscape's disk cache explorers which allows user to browse off line
via Netscape's cache. But this also has the several limitations:
-
Netscape do not caches web pages without 'Last-Modified' HTTP
header. If fact It does. Pages are stored on the disk, but never readied back
and Netscape deletes them on exit - so there are lost. This is the biggest
problem, because nowadays many web pages are generated on the fly by WWW
server, so you have only images in the cache.
-
Cache is very slow when grows in size into 30-40 MB. This is not a problem in
UNIX version of the Netscape, but OS/2 and Windows versions have this problem.
-
All informations about cache are stored in one file -
index.db
.
When this file gets corrupted (not so uncommon) you will lost
everything.
-
Garbage collections is very strange. I have set disk cache to 50 MB, when it
grows over 50 MB Netscape deletes nearly all files and leaves only 10 MB in
cache. Too bad.
- Using Microsoft Internet Explorer's disk cache
-
I do not believe what I see. This was much worse that Netscape. MSIE 4 is
stupid and it caches even badly downloaded (too short) file. This badly
downloaded file displays as good, and If you request reload on that bad file,
It does not gets reloaded, only checked via If-Modified-Since request.
If you want remove this bad file from cache, you must clear the entire cache.
No MSIE, thank you.
- Using web grabbers
-
This looks very promising, but there are some problems:
-
web grabber downloads what it wants, so it normally downloads many useless
pages and not pages which you want to see.
-
web grabber has a very few configurations options. Even very good program,
such as
wget
is so stupid. This does not apply to my new
developed web downloaded with working name loader
.
-
biggest problem is with web pages refreshing. You have only three choices -
refresh all (this normally downloads entire set of pages again), never refresh,
or refresh it manually via WWW browser and Save as...
- Using Lotus Notes/Domino
-
Lotus Notes can work with HTML documents the same way as it does with it's
normal Document database. You may use any Notes's features, such as Agents or
Scripts on WWW documents. This is very good for writing Internet or Intranet
applications, but not the best solution for normal browsing. The built-in WWW
browser is very limited, even when compared to old Netscape 2. It downloads
only one WWW object at once - web page with many pictures takes very long to
load. Also Notes requires too much system resources and you can not run it on
486 computer with only 20MB RAM.
- WWW Offline Explorer (wwwoffle)
-
This program does the basically same thing for offline browsing as Smart Cache.
It is written in C and is available only for Unixes. I have performed webbench
on both (SC and wwwoffle) and when using small size cache (about 10MB) results
are similar (wwwoffle is about 8% faster, in the same benchmark as used in Smart Cache Performance, Section 7.2 it has
on Linux 984 pages/min), but on large cache size SC is much faster because
wwwoffle uses just one root directory level (SC uses 2) and no www's directory
level; you will end with very large directories, which are very slow to search
(at least on my machine), also WWWOffle's history is recorded as symlinks in
special directory, which makes one symlink for each visited URL. WWWOffle do
not supports old HTTP/0.9 clients. Stored files in wwwoffle has HTTP header
inside and uses long hashed filenames, but if you use HTML interface, cache
contents can be browsed. wwwoffle has nice built-in HTML interface and may be
easily controlled by browser, also allows marking pages for later batch
downloading (good thing) or update. Batch downloading does not works well,
when I tried it, it very often ends with infinite loops on already downloaded
URLs.
Summary: If you have a very large network or if you want Squid, get Squid. If
you don't like SC get wwwoffle, it will do the good job also, HTML interface in
nice thing. If you dont like Squid, SC or wwwoffle, I have no idea, what to
use.
After considering Other off line browsing
solutions, Section 1.4, I decided to write my own program, which solves all
of these problems. I write down following design notes:
-
Perfect off line browsing support. User must not see any difference between on
line and off line browsing.
-
Implement it as proxy cache. It will be independent on the used browser and
fully transparent to user.
-
Use CERN-httpd like (not hashed like Apache, Squid or WWWOffle) directory
structure of the proxy cache for easy locating of cached objects. Try to use
read file names like index.html and not obscure hashed like
Q3E4R2T342XCV3F42G3H2323.
-
For performance reasons, implement 2 swap directory levels (idea from Squid).
-
For easy file access do not store HTTP headers inside cached objects. When I
am trying to extract binary files (pictures) from CERN, Squid or Apache's
caches with text editor (I was lazy to write special program) that claims to
support binary files (ViM) it still fails and file gets corrupted.
-
Do not store all received HTTP headers, just important one.
-
Program must be fully portable. I want to use it on OS/2, Linux and Windows.
-
Cache must be able to cache everything what other caches don't. I don't to
write good cache which respects headers which webmaster uses to gain more
hits. Modem lines are slow. In fact, after writing Smart Cache I was
surprised, how much faster can browsing be if we cache something more than
usual and kill some adv. banners. This really makes a difference!
-
Program must allow to block unwanted URLs. Yes, for killing adv. banners.
-
Program must remain fast and simple.
-
Extremely configurable and tunable garbage collection. I can't accept the
design all or nothing used in other caches. I want to control what
and how long stays on my disk.
-
Possibility to continue with object downloading even if user press STOP in the
browser (idea from Squid).
-
Program must be robust and possibility of data loss must be minimalized.
[ Abstract ]
[ Copyright Notice ]
[ Contents ]
[ next ]
Smart Cache Manual
0.49.1
Radim Kolar hsn@cybermail.net