Last update : 2003-02-21
Note : The author needs help in spelling correction in this version. If you want to contribute to it, mail the author to obtain a XML version of this document.
Previous documentations corrections by Brien Louque.
At address : http://phpdig.toiletoine.net
HTTP Spidering
PhpDig follows links as it was any web browser within a web server, to build the pages list to index.
Links can be in AreaMap, or frames. PhpDig supports relocations. Any syntax of HREF attribute is followed by Phpdig. Simple javascripts links like window.open() or window.location() are followed too.
PhpDig don't go out the root site you define for the indexing. Spidering depth is choosen by user.
All html content is listed, both static and dynamic pages. PhpDig searches the Mime-Type of the document, or tests existence of an <HTML> tag at the beginning of it.
Full Text Indexing
PhpDig indexes all words of a document, excepting small words (less than 3 letters) an common words, those are definded in a text file.
Lone numbers are not inded, but those included in words. Underscores make part of a word.
Occurences of a word in a document is saved. Words in the title can have a more important weight in ranking results.
Other features
PhpDig Tries to read a robots.txt file at the server root. It searches meta robots tags too.
PhpDig can spider sites served on another port than the default 80.
Password protected sites can be indexed giving to the robot an username and valid password.
Be Careful ! This feature could permit to an unauthorized user reading protected informations. We recommand to create a specific instance of PhpDig, protected by the same credentials than the restricted site. You have to create a special account for the robot too.The Last-Modified header value is stored in the database to avoid redundant indexing. Also the <META> revisit-after tag.
If desired, the engine can store textual content of indexed documents. In this case, relevant extracts of found pages are displayed in the results page with highlighted search keys.
Display templates
A simple templates system permits to adapt search and results page to an existing site look. Making a template consists only in insert few xml-like tags in an html page.
Limits
PhpDig can't perform an exact expression search.
Because of the time consuming indexing process, the Apache/php web server which performs the process must not be safe_mode configured.
This limit can be turn :
- Using distant indexing with MySql TCP connexion and FTP connexion ;
- Launching indexing process in a shell command. This can be made by a cron task.Spidering and indexing is a bit slow. In the other hand, search queries are fast, even in an extended content.
Prerequisites
PhpDig requires a Web server (Apache is my preference) with Php (module or cgi), and a MySql database server.
The following configurations were tested :
PHP 4.0.3pl1 cgi, Apache 1.3.14, MySql 3.23.28, Windows 2000 ;
PHP 4.0.5rc1 module, Apache 1.3.20, MySql 3.23.28, Windows 2000 ;
PHP 4.0.5 module, Apache 1.3.20, MySql 3.23.39, Linux with 2.4 kernel ;
Php/4.1.1, Apache/1.3.20 (Win32), Windows 2000 ;
Php/4.1.2, Apache/1.3.23 (Unix) mod_ssl/2.8.7, Linux kernel 2.4.3 ;
Php/4.3.0, Apache/2.0.44 (Unix) OpenSSL/0.9.6g, Linux kernel 2.4.18.
Scripts installation
Unzip the archive in a directory and configure Apache to serve it. (it will be named [PHPDIG_DIR] in the following) The engine did not need a dedicated VirtualHost to operate.
If PhpDig is installed on an Unix operating system server, set the file permissions to writable on the following directories, for the suid Apache server is running :[PHPDIG_DIR]/text_content [PHPDIG_DIR]/include [PHPDIG_DIR]/admin/tempMySql database installation
There are two processes to install the database.
- Php install script :
In your favorite browser, request the page :[PHPDIG_DIR]/include/install.phpAnd follow the instructions.
This script uses the form datas to complete the fields of the "[PHPDIG_DIR]/include/_connect.php" script and copying it to "[PHPDIG_DIR]/include/connect.php".- Manual installation :
You have to create the database (You can choose any other name than "phpdig") :#mysql mysql mysql> CREATE DATABASE phpdig; mysql> quit #mysql phpdig < [PHPDIG_DIR]/sql/init_db.sqlVerify that all tables are present :
#mysql phpdig mysql> SHOW TABLES;The database answer must be :
+------------------+ | Tables_in_phpdig | +------------------+ | engine | | excludes | | keywords | | sites | | spider | | tempspider | +------------------+ 6 rows in set (0.00 sec) mysql>After the database was created, copy the "[PHPDIG_DIR]/include/_connect.php" file to "[PHPDIG_DIR]/include/connect.php" and edit the new one.
Replace the values "<host>", "<user>", "<pass>", and "<database>" to your database server URL, the username, the password to connect to it (if required) and the name you give to the phpdig database.
In a local installation, the values "localhost", "root", and "" are sufficient in most cases.To verify the install is complete, open the main page [PHPDIG_DIR]/index.php with your favorite web browser.
The search form must be visible.
After the install was complete, the engine can work without modifications in the configuration file. The configuration step depends on your needs. Don't forget to change the administration login and password if you use a Php compiled in an Apache dynamic or static module.
Notice : Authentification doesn't operate with a CGI php. In this case, uses an .htaccess file in order to protect the [PHPDIG_DIR]/admin directory.All configuration parameters are in the [PHPDIG_DIR]/include/config.php file. Each of them is followed by a comment explaining it purpose.
In the following, all statements are lines of the config.php file.
The values are default values.Configuring administrator access
Change the following constants. If you don't want to see a clear password value, use the Apache authentification functions.
define('PHPDIG_ADM_AUTH','1'); //Activates/deactivates the authentification functions define('PHPDIG_ADM_USER','admin'); //Username define('PHPDIG_ADM_PASS','admin'); //PasswordConfiguring robot and engine
Change following variables and constants.
define('SPIDER_MAX_LIMIT',12); //max recurse levels in sipder define('SPIDER_DEFAULT_LIMIT',1); //default value define('RESPIDER_LIMIT',4); //recurse limit for update define('LIMIT_DAYS',7); //default days before reindex a page define('SMALL_WORDS_SIZE',2); //words to not index define('MAX_WORDS_SIZE',30); //max word size define('TITLE_WEIGHT',3); //relative title weight define('CHUNK_SIZE',2048); //chunk size for regex processing define('SUMMARY_LENGTH',500); //length of results summary define('TEXT_CONTENT_PATH','text_content/'); //Text content files path define('CONTENT_TEXT',1); //Activates/deactivates the storage of text content.Configuring templates
Change following variables and constants.
$phpdig_language = "en"; //GUI language $template = "$relative_script_path/templates/phpdig.html"; //Template file path define('HIGHLIGHT_BACKGROUND','yellow'); //Highlighting background color //Only for classic mode define('HIGHLIGHT_COLOR','#000000'); //Highlighting text color //Only for classic mode define('LINK_TARGET','_blank'); //Target for result links define('WEIGHT_IMGSRC','./tpl_img/weight.gif'); //Baragraph image path define('WEIGHT_HEIGHT','5'); //Baragraph height define('WEIGHT_WIDTH','50'); //Max baragraph width define('SEARCH_PAGE','index.php'); //The name of the search page define('SEARCH_DEFAULT_LIMIT',10); //results per page define('SEARCH_DEFAULT_MODE','start'); // default search mode (start|exact|any)FTP configuration (if necessary)
PhpDig doesn't indexes FTP sites. Why by the hell needs it ftp connection ?
Lot of PhpDig users install it on shared web servers, and on those, Php is always configured with safe_mode activated. On those shared hostings, access to thecrontab isn't allowed too.
Another instance of Php, on a distinct server is the solution. In my case, a linux server installed at my home and plugged on a cāble connexion runs the update process for the demo version of PhpDig.
Your hosting must only permits you to connect to your MySql database thru TCP/IP.And what about this famous FTP connection ? It sends textual content of indexed documents to the adequate directory in the distant server.
If you deactivate the FTP function (in case of low-bandwidth connections, like by modem), it is not the extract of documents wich is displayed in results page, but only the summary stored in the database.FTP parameters are the following.
define('FTP_ENABLE',0); //Activate/deactivate the ftp connection define('FTP_HOST','<ftp host dir>'); //FTP server name define('FTP_PORT',21); //FTP port define('FTP_PASV',1); //Use passive mode (PASV), recommended define('FTP_PATH','<phpdig root dir>'); //Path of the phpdig directory on server, relative to the ftp rootdir define('FTP_TEXT_PATH','text_content'); //Text content directory (default) define('FTP_USER','<ftp username>'); //FTP username account define('FTP_PASS','<ftp password>'); //FTP password account
Database update
The [PHPDIG_DIR]/sql/update_db_to[version].sql contains all required SQL instructions to update your existing install of PhpDig.
Scripts update
Save your configurations files, and just replace the existing scripts by the new ones.
Index a new host
Open the admin interface with your browser : [PHPDIG_DIR]/admin/index.php. Just fill in the url field, PhpDig reconizes if it is a new host or an existing one. You can also precise a path and/or a file, wich is the starting point of the robot.
Select the maximum search depth in levels and click on the "Dig This !" button.
A new page opens showing the indexing and spidering process. If a double is displayed, it means that PhpDig has detected that the current document, with a new url, is a duplicate of an existing one in the database.
Each "+" sign means that a new link was detected and will be followed at the next spidering level.
For each level, PhpDig displays the number of new links it has found. If no new link is found, PhpDig stops its browsing and displays the list of the documents.You can also launch an indexing process by the shell prompt :
#php -f [PHPDIG_DIR]/admin/spider.php http://mydomain.comUpdate an existing host
From the admin page, you can reach the update interface by choosing a site and clicking on the [update form] button.
A two parts inteface appears. On the left side of the screen is the client-side folder structure of the site. The blue arrow displays the "folder" content, in order to reindex the documents individually. The document's listing of a folder is on the right side of the screen.On both sides, the red cross deletes all the selected branch or file, including sub-folders in case of deleting a branch, from the engine.
The green check mark reindexes the selected branch or document if they were indexed for more than [$limit_days] days. It also search new links for documents wich are changed.Index maintenance
3 scripts are used to delete useless data in the PhpDig database. The links are in the admin page.
Clean index deletes index records not linked to any page. Useful if manual deletes are done in the database.
Clean dictionary deletes keywords which are not used by the index. Useful for reducing the size of the dictionary, particularly when a large site contains a great deal of technical words and is deleted from the engine.
Clean common words must be run when new common words are added in the [PHPDIG_DIR]/includes/common_words.txt file. It deletes all reference to those common words.
The [PHPDIG_DIR]/admin/spider.php can be launch by a cron task in order to auto update the index. The recommended periodicity is 7 days. The updated documents you want to see immediately in the searches can be updated manually.
Those pages can contain a "revisit-after" metatag with a short delay.The script as two parameters :
#php -f [PHPDIG_DIR]/admin/spider.php allLauches a normal update.
The following syntax :
#php -f [PHPDIG_DIR]/admin/spider.php http://www.mydomain.net/only indexes or updates the http://www.mydomain.net/ website.
Use this option if you want to have distincts update delays on each site registered in the engine, making one cron task for each of them.As any shell command, the output can be redirected to a textfile. (If you want some logs.)
#php -f [PHPDIG_DIR]/admin/spider.php all >> /var/log/phpdig.log
Templates are HTML files containing some xml-like tags wich are replaced with the dynamic PhpDig content.
See the provided templates source code as making templates example.Two CSS classes are used by PhpDig :
.phpdigHighlight : <SPAN/> class for highlighting of search terms.
a.phpdig : <A/> class for phpdig results and navigation links.All template tags look like : <phpdig:parametre/>.
Excepted the <phpdig:results></phpdig:results> tag, all are stand-alone tags.Tags outside the results table
phpdig:title_message Page title phpdig:form_head Starting the search form phpdig:form_title Form title phpdig:form_field Text field of the form phpdig:form_button Submit button of the form phpdig:form_select Select list to choose the num of results per page phpdig:form_radio Radio button to choose the parsing of search keys phpdig:form_foot Ending the search form phpdig:result_message Num of results message phpdig:ignore_message Too short words message phpdig:ignore_commess Too common words message phpdig:nav_bar Navigation bar to browse results phpdig:pages_bar Navigation bar without previous/next links phpdig:previous_link src='[img src]' "Previous" icon phpdig:next_link src='[img src]' "Next" iconResults table tags
phpdig:results Contains results list phpdig:img_tag Relevance Baragraph phpdig:weight Relevance of the page (in percents) phpdig:page_link Result title and link to the document phpdig:limit_links Links of limitation to an host / path phpdig:text Highlighted text extract or summary phpdig:n Result ranking, starting 1. phpdig:complete_path Complete URL of the document phpdig:update_date Last update of the document phpdig:filesize Size of the document (KiloBytes)
The search form is so simple that it not needs lot of explain. But it could be useful to know that :
- An AND operator is applied between each search key ;
- Putting a '-' sign before a word excludes it from the search results. No document containing this word would be displayed ;
- Search is case-insensitive and accent-insensitive. In the other hand, results highlighting is accent-sensitive.
A small messageboard dedicated to PhpDig can be found at : http://phpdig.toiletoine.net/messageboard/
Ask there any questions you have about this script.You can also mail the author at phpdig@toiletoine.net
File created by XSLT parser Php 4.3.0 - Sablotron 0.96