[ Abstract ] [ Copyright Notice ] [ Contents ] [ next ]

Smart Cache Manual
Chapter 1 Introduction

1.1 About this manual

Some sections are still missing or inaccurate. If you are out of the luck, see comments in the sample configurations files or consult sources for more informations.

This manual has been converted from Smart Cache English homepage to debiandoc-sgml format, which allows to generate many output formats from one source. After conversion, this manual was extended by Radim Kolar into current form and merged with translated Czech documentation, which is no longer maintained.

English is not my native language and if you see any errors, just ignore it or mail me.

1.2 When all things began

, the Word already was. The Word dwelt with God, and what God was, the Word was. The Word, then, was with God at the beginning, and through him all things came to be; no single thing was created without him. All that came to be was alive with his life, and that life was the light of men. The light shines on the dark, and the darkness never mastered it.

New Testament, The Gospel according to John, The coming of Christ.

1.3 Smart Cache born

After leaving my job, I start to use modem connection to Internet. It was slow, but biggest problem for me was a quite high prices payed to the monopoly Czech telecommunication company SPT Telecom (now renamed to Czech Telecom, because many people did not know what SPT is). I have find that I need some useful tool, which allows me to browse WWW pages off line.

I have tried several methods (See Other off line browsing solutions, Section 1.4) to achieve this goal, but all of them has some limitations and I found them totally unusable for me. These programs are not bad, there are just not optimal for this what I wants.

1.4 Other off line browsing solutions

IBM Internet Connection Server 4.0

This is WWW server with built in proxy cache. Proxy cache uses simple CERN-like directory structure, so it was easy to find cached files. Also proxy cache has the switch for off line mode, when returns only cached pages. Biggest problem with this server was, that server was that this server is based on original CERN http daemon, which was not thread-safe. IBM ported this daemon into OS/2, but they does not care about this and do not implemented any locking mechanism to protect thread sensitive data and for thread synchronization. Server complains very often about locked .cacheinfo files and loaded objects was not stored at disk. After some time IBM has made new version 4.1. This version introduced new HTTP/1.1 support into WWW server and proxy cache. WWW server works with some occasional crashes, but proxy cache was totally broken. I never managed it running, they probably do not test this part of their product. After some time IBM abandoned this server and recommends ICS's users to upgrade to Lotus Domino.

Mailing pages to myself in Netscape

I have found, that in Netscape Navigator is possible to email entire web page. So I started mailing interesting web to me and browsing it via Sent Mail folder. This works quite well, off line browsing was possible (even with embedded pictures). But Netscape do not save pictures into Sent Mail folder, It saved pictures only into its internal disk cache, so after expiring pictures, I was unable to see that.

Using Netscape's internal disk cache

Netscape browser has persistent disk cache. This disk cache is able to cache web objects between sessions and there are couple of programs called Netscape's disk cache explorers which allows user to browse off line via Netscape's cache. But this also has the several limitations:

Netscape do not caches web pages without 'Last-Modified' HTTP header. If fact It does. Pages are stored on the disk, but never readied back and Netscape deletes them on exit - so there are lost. This is the biggest problem, because nowadays many web pages are generated on the fly by WWW server, so you have only images in the cache.

Cache is very slow when grows in size into 30-40 MB. This is not a problem in UNIX version of the Netscape, but OS/2 and Windows versions have this problem.

All informations about cache are stored in one file - index.db. When this file gets corrupted (not so uncommon) you will lost everything.

Garbage collections is very strange. I have set disk cache to 50 MB, when it grows over 50 MB Netscape deletes nearly all files and leaves only 10 MB in cache. Too bad.

Using Microsoft Internet Explorer's disk cache

I do not believe what I see. This was much worse that Netscape. MSIE 4 is stupid and it caches even badly downloaded (too short) file. This badly downloaded file displays as good, and If you request reload on that bad file, It does not gets reloaded, only checked via If-Modified-Since request. If you want remove this bad file from cache, you must clear the entire cache. No MSIE, thank you.

Using web grabbers

This looks very promising, but there are some problems:

web grabber downloads what it wants, so it normally downloads many useless pages and not pages which you want to see.

web grabber has a very few configurations options. Even very good program, such as wget is so stupid. This does not apply to my new developed web downloaded with working name loader.

biggest problem is with web pages refreshing. You have only three choices - refresh all (this normally downloads entire set of pages again), never refresh, or refresh it manually via WWW browser and Save as...

Using Lotus Notes/Domino

Lotus Notes can work with HTML documents the same way as it does with it's normal Document database. You may use any Notes's features, such as Agents or Scripts on WWW documents. This is very good for writing Internet or Intranet applications, but not the best solution for normal browsing. The built-in WWW browser is very limited, even when compared to old Netscape 2. It downloads only one WWW object at once - web page with many pictures takes very long to load. Also Notes requires too much system resources and you can not run it on 486 computer with only 20MB RAM.

WWW Offline Explorer (wwwoffle)

This program does the basically same thing for offline browsing as Smart Cache. It is written in C and is available only for Unixes. I have performed webbench on both (SC and wwwoffle) and when using small size cache (about 10MB) results are similar (wwwoffle is about 8% faster, in the same benchmark as used in Smart Cache Performance, Section 7.2 it has on Linux 984 pages/min), but on large cache size SC is much faster because wwwoffle uses just one root directory level (SC uses 2) and no www's directory level; you will end with very large directories, which are very slow to search (at least on my machine), also WWWOffle's history is recorded as symlinks in special directory, which makes one symlink for each visited URL. WWWOffle do not supports old HTTP/0.9 clients. Stored files in wwwoffle has HTTP header inside and uses long hashed filenames, but if you use HTML interface, cache contents can be browsed. wwwoffle has nice built-in HTML interface and may be easily controlled by browser, also allows marking pages for later batch downloading (good thing) or update. Batch downloading does not works well, when I tried it, it very often ends with infinite loops on already downloaded URLs.

Summary: If you have a very large network or if you want Squid, get Squid. If you don't like SC get wwwoffle, it will do the good job also, HTML interface in nice thing. If you dont like Squid, SC or wwwoffle, I have no idea, what to use.

1.5 Smart Cache design goals

After considering Other off line browsing solutions, Section 1.4, I decided to write my own program, which solves all of these problems. I write down following design notes:

Perfect off line browsing support. User must not see any difference between on line and off line browsing.

Implement it as proxy cache. It will be independent on the used browser and fully transparent to user.

Use CERN-httpd like (not hashed like Apache, Squid or WWWOffle) directory structure of the proxy cache for easy locating of cached objects. Try to use read file names like index.html and not obscure hashed like Q3E4R2T342XCV3F42G3H2323.

For performance reasons, implement 2 swap directory levels (idea from Squid).

For easy file access do not store HTTP headers inside cached objects. When I am trying to extract binary files (pictures) from CERN, Squid or Apache's caches with text editor (I was lazy to write special program) that claims to support binary files (ViM) it still fails and file gets corrupted.

Do not store all received HTTP headers, just important one.

Program must be fully portable. I want to use it on OS/2, Linux and Windows.

Cache must be able to cache everything what other caches don't. I don't to write good cache which respects headers which webmaster uses to gain more hits. Modem lines are slow. In fact, after writing Smart Cache I was surprised, how much faster can browsing be if we cache something more than usual and kill some adv. banners. This really makes a difference!

Program must allow to block unwanted URLs. Yes, for killing adv. banners.

Program must remain fast and simple.

Extremely configurable and tunable garbage collection. I can't accept the design all or nothing used in other caches. I want to control what and how long stays on my disk.

Possibility to continue with object downloading even if user press STOP in the browser (idea from Squid).

Program must be robust and possibility of data loss must be minimalized.