PhpDig version 1.4 Documentation

Last update : 2003-02-21
Note : The author needs help in spelling correction in this version. If you want to contribute to it, mail the author to obtain a XML version of this document.
Previous documentations corrections by Brien Louque.

Where find the lastest PhpDig version ?

At address : http://phpdig.toiletoine.net

PhpDig Features

HTTP Spidering

PhpDig follows links as it was any web browser within a web server, to build the pages list to index.

Links can be in AreaMap, or frames. PhpDig supports relocations. Any syntax of HREF attribute is followed by Phpdig. Simple javascripts links like window.open() or window.location() are followed too.

PhpDig don't go out the root site you define for the indexing. Spidering depth is choosen by user.

All html content is listed, both static and dynamic pages. PhpDig searches the Mime-Type of the document, or tests existence of an <HTML> tag at the beginning of it.

Full Text Indexing

PhpDig indexes all words of a document, excepting small words (less than 3 letters) an common words, those are definded in a text file.

Lone numbers are not inded, but those included in words. Underscores make part of a word.

Occurences of a word in a document is saved. Words in the title can have a more important weight in ranking results.

Other features

PhpDig Tries to read a robots.txt file at the server root. It searches meta robots tags too.

PhpDig can spider sites served on another port than the default 80.
Password protected sites can be indexed giving to the robot an username and valid password.
Be Careful ! This feature could permit to an unauthorized user reading protected informations. We recommand to create a specific instance of PhpDig, protected by the same credentials than the restricted site. You have to create a special account for the robot too.

The Last-Modified header value is stored in the database to avoid redundant indexing. Also the <META> revisit-after tag.

If desired, the engine can store textual content of indexed documents. In this case, relevant extracts of found pages are displayed in the results page with highlighted search keys.

Display templates

A simple templates system permits to adapt search and results page to an existing site look. Making a template consists only in insert few xml-like tags in an html page.

Limits

PhpDig can't perform an exact expression search.

Because of the time consuming indexing process, the Apache/php web server which performs the process must not be safe_mode configured.
This limit can be turn :
- Using distant indexing with MySql TCP connexion and FTP connexion ;
- Launching indexing process in a shell command. This can be made by a cron task.

Spidering and indexing is a bit slow. In the other hand, search queries are fast, even in an extended content.

Installation

Prerequisites

PhpDig requires a Web server (Apache is my preference) with Php (module or cgi), and a MySql database server.

The following configurations were tested :
PHP 4.0.3pl1 cgi, Apache 1.3.14, MySql 3.23.28, Windows 2000 ;
PHP 4.0.5rc1 module, Apache 1.3.20, MySql 3.23.28, Windows 2000 ;
PHP 4.0.5 module, Apache 1.3.20, MySql 3.23.39, Linux with 2.4 kernel ;
Php/4.1.1, Apache/1.3.20 (Win32), Windows 2000 ;
Php/4.1.2, Apache/1.3.23 (Unix) mod_ssl/2.8.7, Linux kernel 2.4.3 ;
Php/4.3.0, Apache/2.0.44 (Unix) OpenSSL/0.9.6g, Linux kernel 2.4.18.

Scripts installation

Unzip the archive in a directory and configure Apache to serve it. (it will be named [PHPDIG_DIR] in the following) The engine did not need a dedicated VirtualHost to operate.
If PhpDig is installed on an Unix operating system server, set the file permissions to writable on the following directories, for the suid Apache server is running :

[PHPDIG_DIR]/text_content
[PHPDIG_DIR]/include
[PHPDIG_DIR]/admin/temp

MySql database installation

There are two processes to install the database.

- Php install script :
In your favorite browser, request the page :

[PHPDIG_DIR]/include/install.php

And follow the instructions.
This script uses the form datas to complete the fields of the "[PHPDIG_DIR]/include/_connect.php" script and copying it to "[PHPDIG_DIR]/include/connect.php".

- Manual installation :
You have to create the database (You can choose any other name than "phpdig") :

#mysql mysql
mysql> CREATE DATABASE phpdig;
mysql> quit

#mysql phpdig < [PHPDIG_DIR]/sql/init_db.sql

Verify that all tables are present :

#mysql phpdig
mysql> SHOW TABLES;

The database answer must be :

+------------------+
| Tables_in_phpdig |
+------------------+
| engine           |
| excludes         |
| keywords         |
| sites            |
| spider           |
| tempspider       |
+------------------+
6 rows in set (0.00 sec)

mysql>

After the database was created, copy the "[PHPDIG_DIR]/include/_connect.php" file to "[PHPDIG_DIR]/include/connect.php" and edit the new one.
Replace the values "<host>", "<user>", "<pass>", and "<database>" to your database server URL, the username, the password to connect to it (if required) and the name you give to the phpdig database.
In a local installation, the values "localhost", "root", and "" are sufficient in most cases.

To verify the install is complete, open the main page [PHPDIG_DIR]/index.php with your favorite web browser.
The search form must be visible.

Configuration

After the install was complete, the engine can work without modifications in the configuration file. The configuration step depends on your needs. Don't forget to change the administration login and password if you use a Php compiled in an Apache dynamic or static module.
Notice : Authentification doesn't operate with a CGI php. In this case, uses an .htaccess file in order to protect the [PHPDIG_DIR]/admin directory.

All configuration parameters are in the [PHPDIG_DIR]/include/config.php file. Each of them is followed by a comment explaining it purpose.

In the following, all statements are lines of the config.php file.
The values are default values.

Configuring administrator access

Change the following constants. If you don't want to see a clear password value, use the Apache authentification functions.

define('PHPDIG_ADM_AUTH','1');     //Activates/deactivates the authentification functions
define('PHPDIG_ADM_USER','admin'); //Username
define('PHPDIG_ADM_PASS','admin'); //Password

Configuring robot and engine

Change following variables and constants.

define('SPIDER_MAX_LIMIT',12);          //max recurse levels in sipder
define('SPIDER_DEFAULT_LIMIT',1);       //default value
define('RESPIDER_LIMIT',4);             //recurse limit for update

define('LIMIT_DAYS',7);                 //default days before reindex a page
define('SMALL_WORDS_SIZE',2);           //words to not index
define('MAX_WORDS_SIZE',30);            //max word size

define('TITLE_WEIGHT',3);               //relative title weight
define('CHUNK_SIZE',2048);              //chunk size for regex processing

define('SUMMARY_LENGTH',500);           //length of results summary


define('TEXT_CONTENT_PATH','text_content/'); //Text content files path
define('CONTENT_TEXT',1);                    //Activates/deactivates the storage of text content.

Configuring templates

Change following variables and constants.

$phpdig_language = "en";                        //GUI language

$template = "$relative_script_path/templates/phpdig.html";  //Template file path
define('HIGHLIGHT_BACKGROUND','yellow');         //Highlighting background color
                                                 //Only for classic mode

define('HIGHLIGHT_COLOR','#000000');             //Highlighting text color
                                                 //Only for classic mode

define('LINK_TARGET','_blank');                  //Target for result links
define('WEIGHT_IMGSRC','./tpl_img/weight.gif');  //Baragraph image path
define('WEIGHT_HEIGHT','5');                     //Baragraph height
define('WEIGHT_WIDTH','50');                     //Max baragraph width
define('SEARCH_PAGE','index.php');               //The name of the search page

define('SEARCH_DEFAULT_LIMIT',10);      //results per page
define('SEARCH_DEFAULT_MODE','start');  // default search mode (start|exact|any)

FTP configuration (if necessary)

PhpDig doesn't indexes FTP sites. Why by the hell needs it ftp connection ?

Lot of PhpDig users install it on shared web servers, and on those, Php is always configured with safe_mode activated. On those shared hostings, access to thecrontab isn't allowed too.

Another instance of Php, on a distinct server is the solution. In my case, a linux server installed at my home and plugged on a cāble connexion runs the update process for the demo version of PhpDig.
Your hosting must only permits you to connect to your MySql database thru TCP/IP.

And what about this famous FTP connection ? It sends textual content of indexed documents to the adequate directory in the distant server.
If you deactivate the FTP function (in case of low-bandwidth connections, like by modem), it is not the extract of documents wich is displayed in results page, but only the summary stored in the database.

FTP parameters are the following.

define('FTP_ENABLE',0);              //Activate/deactivate the ftp connection
define('FTP_HOST','<ftp host dir>'); //FTP server name
define('FTP_PORT',21);               //FTP port
define('FTP_PASV',1);                //Use passive mode (PASV), recommended
define('FTP_PATH','<phpdig root dir>'); //Path of the phpdig directory on server, relative to the ftp rootdir
define('FTP_TEXT_PATH','text_content'); //Text content directory (default)
define('FTP_USER','<ftp username>');  //FTP username account
define('FTP_PASS','<ftp password>');  //FTP password account

Update

Database update

The [PHPDIG_DIR]/sql/update_db_to[version].sql contains all required SQL instructions to update your existing install of PhpDig.

Scripts update

Save your configurations files, and just replace the existing scripts by the new ones.

Manual indexing

Index a new host

Open the admin interface with your browser : [PHPDIG_DIR]/admin/index.php. Just fill in the url field, PhpDig reconizes if it is a new host or an existing one. You can also precise a path and/or a file, wich is the starting point of the robot.

Select the maximum search depth in levels and click on the "Dig This !" button.

A new page opens showing the indexing and spidering process. If a double is displayed, it means that PhpDig has detected that the current document, with a new url, is a duplicate of an existing one in the database.
Each "+" sign means that a new link was detected and will be followed at the next spidering level.
For each level, PhpDig displays the number of new links it has found. If no new link is found, PhpDig stops its browsing and displays the list of the documents.

You can also launch an indexing process by the shell prompt :

#php -f [PHPDIG_DIR]/admin/spider.php http://mydomain.com

Update an existing host

From the admin page, you can reach the update interface by choosing a site and clicking on the [update form] button.
A two parts inteface appears. On the left side of the screen is the client-side folder structure of the site. The blue arrow displays the "folder" content, in order to reindex the documents individually. The document's listing of a folder is on the right side of the screen.

On both sides, the red cross deletes all the selected branch or file, including sub-folders in case of deleting a branch, from the engine.
The green check mark reindexes the selected branch or document if they were indexed for more than [$limit_days] days. It also search new links for documents wich are changed.

Index maintenance

3 scripts are used to delete useless data in the PhpDig database. The links are in the admin page.

Clean index deletes index records not linked to any page. Useful if manual deletes are done in the database.

Clean dictionary deletes keywords which are not used by the index. Useful for reducing the size of the dictionary, particularly when a large site contains a great deal of technical words and is deleted from the engine.

Clean common words must be run when new common words are added in the [PHPDIG_DIR]/includes/common_words.txt file. It deletes all reference to those common words.

Auto update

The [PHPDIG_DIR]/admin/spider.php can be launch by a cron task in order to auto update the index. The recommended periodicity is 7 days. The updated documents you want to see immediately in the searches can be updated manually.
Those pages can contain a "revisit-after" metatag with a short delay.

The script as two parameters :

#php -f [PHPDIG_DIR]/admin/spider.php all

Lauches a normal update.

The following syntax :

#php -f [PHPDIG_DIR]/admin/spider.php http://www.mydomain.net/

only indexes or updates the http://www.mydomain.net/ website.
Use this option if you want to have distincts update delays on each site registered in the engine, making one cron task for each of them.

As any shell command, the output can be redirected to a textfile. (If you want some logs.)

#php -f [PHPDIG_DIR]/admin/spider.php all >> /var/log/phpdig.log

Template tags list

Templates are HTML files containing some xml-like tags wich are replaced with the dynamic PhpDig content.
See the provided templates source code as making templates example.

Two CSS classes are used by PhpDig :
.phpdigHighlight : <SPAN/> class for highlighting of search terms.
a.phpdig : <A/> class for phpdig results and navigation links.

All template tags look like : <phpdig:parametre/>.
Excepted the <phpdig:results></phpdig:results> tag, all are stand-alone tags.

Tags outside the results table

phpdig:title_message   Page title

phpdig:form_head       Starting the search form
phpdig:form_title      Form title
phpdig:form_field      Text field of the form
phpdig:form_button     Submit button of the form
phpdig:form_select     Select list to choose the num of results per page
phpdig:form_radio      Radio button to choose the parsing of search keys
phpdig:form_foot       Ending the search form

phpdig:result_message         Num of results message
phpdig:ignore_message         Too short words message
phpdig:ignore_commess         Too common words message

phpdig:nav_bar         Navigation bar to browse results
phpdig:pages_bar       Navigation bar without previous/next links
phpdig:previous_link src='[img src]'   "Previous" icon
phpdig:next_link src='[img src]'       "Next" icon

Results table tags

phpdig:results       Contains results list

phpdig:img_tag       Relevance Baragraph
phpdig:weight        Relevance of the page (in percents)
phpdig:page_link     Result title and link to the document
phpdig:limit_links   Links of limitation to an host / path
phpdig:text          Highlighted text extract or summary
phpdig:n             Result ranking, starting 1.
phpdig:complete_path Complete URL of the document
phpdig:update_date   Last update of the document
phpdig:filesize      Size of the document (KiloBytes)

Clearings on search

The search form is so simple that it not needs lot of explain. But it could be useful to know that :

- An AND operator is applied between each search key ;
- Putting a '-' sign before a word excludes it from the search results. No document containing this word would be displayed ;
- Search is case-insensitive and accent-insensitive. In the other hand, results highlighting is accent-sensitive.

Getting help about PhpDig

A small messageboard dedicated to PhpDig can be found at : http://phpdig.toiletoine.net/messageboard/
Ask there any questions you have about this script.

You can also mail the author at phpdig@toiletoine.net

File created by XSLT parser Php 4.3.0 - Sablotron 0.96