HTTrack
The Web Mirror Utility

(Short) Documentation

back


I- Quick start (Windows release)

Just follow the steps:

  1. Launch WinHTTrack, choose an option (Mirror sites, Mirror with wizard [ie semi automatic mode], and Get separated files).

  2. Enter URLs (i.e. Internet adresses, suck as www.test.fr/~bob/) in the URL list.
    Optionally, click to the Filters.. button to define filters for links.

    doc1.gif (15201 bytes)

  3. Optionally, you can specify a limited link depth (if not, the entire site will be mirrored ; e.g. www.test.abs/~mike/ will mirror all Mike's site). You can also specify a proxy (ask your administrator). Do not forget the paths for mirror files (the files retreived) and log files (files indicating errors or actions done)

  4. Click to the NEXT-> button. You can start the mirror by clicking START or define a lot of options.


Tip: You can enter more than one URL, by pressing Control-Enter after each line. This will mirror several sites together.


Options
: Many options can be defined (maximum file size, site size, building option, timeout etc etc.)

Proxy: Set the proxy field if you want to use it (ask your internet provider if you do not know the proxy name/or the proxy port)

Filters
: By clicking this button, you will be able to define filters. You can user the "Exclude links" and "Accept links" buttons under Windows

doc2.gif (19820 bytes) doc3.gif (7935 bytes)



Note: Filters are analyzed in the order you have defined them. E.g. if you accept all files from a domain, and after you forbide all gif files, gif files from the first domain will be forbidden. If after the two former filters you define a third filter accepting all filenames 'mydraw.gif', gif files from the first domain will be forbidden except 'mydraw.gif' files. Remember that the order you define filters is important. Besides, filters you define overrides several options like travel options.

More details about filters are described below if you want to control precisely the filters possibilities (if not, jump this section):

You have to know that once you have defined starts links, the default mode will mirror these links - i.e. if one of your start page is www.myweb.com/test/index.html, all links starting with www.myweb.com/test/ will be accepted. But links directly in www.myweb.com/.. will not be accepted, however, because they are in a higher strcuture. This prevent HTTrack from mirroring the whole site. (All files in structure levels equal or lower than the primary links will be retreived.)
You can refuse some files with filters, as we will see below.

Filters are analyzed by HTTrack from the first filter to the last one. The complete URL name is compared to filters defined by the user of added automatically by HTTrack.
For every filter; if the link is recognized, and if a '+' was typed before the filter, the link is considered as "accepted" If '-' was defined and the link is recognized, the links is considered as "forbidden". Every new status overrides the last one: hierarchy is important. If no status could be defined, HTTrack decides himself what to do by analyzing the link (upper/lower structure, and so on..)


Here are some examples for filters: (that can be generated automatically using the interface)

www.thisweb.com* This will refuse/accept this web site (all links located in it will be rejected)
*.com/* This will refuse/accept all links that contains .com in them
*cgi-bin* This will refuse/accept all links that contains cgi-bin in them
www.*.com/*[path].zip This will refuse/accept all zip files in .com addresses
*myweb*/*.tar* This will refuse/accept all tar (or tar.gz etc.) files in hosts containing myweb
*/*mypage* This will refuse/accept all links containing mypage (but not in the address)
*.html This will refuse/accept all html files.
Warning! With this filter you will accept ALL html files, even those in other addresses. (causing a global (!) web mirror..) Use www.myweb.com/*.html to accept all html files from a web.
*.html*[] Identical to *.html, but the link must not have any supplemental characters at the end (links with parameters, like www.myweb.com/index.html?page=10, will be refused)


Special jokers can be used for specific characters as you have seen: (*[..])

* any characters (the most commonly used)
*[file] or *[name] any filename or name, e.g. not /,? and ; characters
*[path] any path (and filename), e.g. not ? and ; characters
*[a,z,e,r,t,y] any letters among a,z,e,r,t,y
*[a-z] any letters
*[0-9,a,z,e,r,t,y] any characters among 0..9 and a,z,e,r,t,y
*[] no characters must be present after





Tip
: To use WinHTTrack as a spider (for checking links), just set the scan mode as "Just scan", mark the boxes "Log files" and "Test all links" and unmark the "Cache" box.
Use combination of all options to have different results.

 

IIb- FAQ (WinHTTrack and HTTrack)


Tip: In case of troubles/problems during transfer, you can have a look at the hts-err.txt (and hts-log.txt) file to see what happened. These log files report all events that may be useful to detect a problem.

Troubleshooting:
When I use HTTrack, nothing is mirrored (no files) What's happening?
Some pages can't be seen, or are displayed with errors!
HTTrack is being idle for a long time without transfering. Whant's happening?
I am behind a firewall. What can I do?
HTTrack has crashed during a mirror, what's happening?

Questions concerning a mirror:
I want to mirror a Web site, but there are some files outside the domain, too. How to retreive them?
I have forgotten some URLs of files during a long mirror.. Should I redo all?
I just want to retreive all ZIP files or other files in a web site/in a page. How to do it?
There are ZIP files in a page, but I don't want to transfer them. How to do?
I don't want to load gif files.. but what may happen if I watch the page?
I get all types of files on a web site, but I didn't select them on filters!
When I use filters, I get too many files!
When I use filters, I can't access another domain, but I have filtered it!
Must I add a  '+' or '-' in the filter list when I want to use filters?
I want to find file(s) in a web-site. How to do?


Troubleshooting:

Q: When I use HTTrack, nothing is mirrored (no files) What's happening?
A: First, be sure that the URL typed is correct. Then, check if you need to use a proxy server (see proxy options in WinHTTrack or the -P proxy:port option in the command line program). The site you want to mirror may only accept certain browsers. You can change your "browser identity" with the Browser ID option in the OPTION box. Finally, you can have a look at the hts-err.txt (and hts-log.txt) file to see what happened.

Q: Some pages can't be seen, or are displayed with errors!
A: Some pages may include javascript or java files that are not recognized. For example, generated filenames. There may be transfer problems, too (broken pipe, etc.). But most mirrors do work. We still are working to improve the mirror quality of HTTrack.

Q: HTTrack is being idle for a long time without transfering. Whant's happening?
A: Maybe you try to reach some very slow sites. Try a lower TimeOut value (see options, or -Txx option in the command line program). Note that you will abandon the entire site (except if the option is unchecked) if a timeout happen You can, with the Shell version, skip some slow files, too.

Q: I am behind a firewall. What can I do?
A: You need to use a proxy, too. Ask your administrator to know the proxy server's name/port. Then, use the proxy field in HTTrack or use the -P proxy:port option in the command line program.

Q: HTTrack has crashed during a mirror, what's happening?
A: We are trying to avoid bugs and problems so that the program can be as reliable as possible. But we can not be infallible. If you occurs a bug, please check if you have the latest release of HTTrack, and send us an email with a detailed description of your problem (OS type, addresses concerned, crash description, and everything you deem to be necessary). This may help the other users too.


Retreive options:

Q: I want to mirror a Web site, but there are some files outside the domain, too. How to retreive them?
A: If you just want to retreive files that can be reached through links, just activate the 'get file near links' option. But if you want to retreive html pages too, you can both use wildcards or explicit addresses ; e.g. add www.myweb.com/* to accept all files and pages from www.myweb.com.

Q: I have forgotten some URLs of files during a long mirror.. Should I redo all?
A: No, if you have kept the 'cache' files (in hts-cache), cached files will not be retransfered.

Q: I just want to retreive all ZIP files or other files in a web site/in a page. How to do it?
A: You can use different methods. You can use the 'get files near a link' option if files are in an outside domain. You can use, too, a filter adress: adding -* +*.zip in the URL list (or in the filter list) will accept all ZIP files, even if these files are outside the address. Example : www.myweb.com/myaddress.html -* +*.zip will allow you to retreive all zip files on the site.

Q: There are ZIP files in a page, but I don't want to transfer them. How to do?
A: Just filter them: add -*.zip in the filter list.

Q: I don't want to load gif files.. but what may happen if I watch the page?
A: If you have filtered gif files (-*.gif), links to gif files will be rebuild so that your browser can find them on the server.

Q: I get all types of files on a web site, but I didn't select them on filters!
A: By default, HTTrack retreives all types of files on authorized links. To avoir that, define filters like
-* +<website>/*.html +<website>/*.htm +<website>/ +*.<type wanted>

Q: When I use filters, I get too many files!
A: You are using too large filters, for example *.html will get ALL html files identified. If you want to get all files on an address, use www.<address>/*.html. There are lots of possibilities using filters.

Q: When I use filters, I can't access another domain, but I have filtered it!
A: You may have done a mistake declaring filters, for example +www.myweb.com/* -*myweb* will not work, because -*myweb* has an upper priority (because it has been declared after +www.myweb.com)

Q: Must I add a  '+' or '-' in the filter list when I want to use filters?
A: YES. '+' is for accepting links and '-' to avoid them. If you forget it, HTTrack will consider that you want to accept a filter if there is a joker in the syntax - e.g. +<filter> if identical to <filter> if <filter> contains a joker (*) (else it will be considered as a normal link to mirror)


Q: I want to find file(s) in a web-site. How to do?
A: You can use the filters: forbid all files (add a -* in the filter list) and accept only html files and the file(s) you want to retreive (BUT do not forget to add +<website>*.html in the filter list, or pages will not be scanned! Add the name of files you want with a */ before ; i.e. if you want to retreive file.zip, add */file.zip)


II- How to use HTTrack (the command-line version)

The command-line program is available for many systems (PC, Linux PC, Sun Solais, AIX) and allows you to control the robot through a command-line. This can be useful for an automatic mirror of a web site.

IIb- Example: Use of HTTrack (the command-line version)


You are a webmaster, and you would like to make a mirror of a web-site:
Every week (or every day), you can launch (ex: crontab):

httrack --update www.myweb.abc -O /public_html/,/home/root/

This will maintain an up-to-date web site into your host.


You are a simple user, and you would like to make a mirror of a web-site for your own:
Just type:

httrack www.myweb.abc


When you want to update it, just launch: httrack --update and httrack will automatically update it.


You want to check links in a site/web page :
Just type:

httrack www.myweb.abc --spider

And look at the file hts-err.txt : all errors will be reported here.

 


⌐1998 Xavier Roche & Yann Philippot
Comments, questions, problems and bugs report are welcome, for the
shell and for the robot.