(**************** SplitBase Data Management Systems **************** * * * Copyright (c) 2001 Leon O. Romain * * * * leon@kafou.com * * * *******************************************************************) HISTORY I conceived SplitBase about 8 years ago as an alternative to the available Database Management Programs and also to alleviate some of the burden imposed on programmers by Borland's BDE. During that time, I have been involved on projects that either did not require database programming or those for which the clients requested specific systems. I had also hoped that by now the public domain and/or the open-source communities would have been flooded with either good or excellent alternatives to both the BDE and commercial databases. Although there have been many outstanding implementations in that direction, none of them had addressed all the issues I am, to this day, still concerned with. Now, being faced with the prospect of many data management projects in my immediate future, I decided to finally start working on my ideas, and that is how I came up with SplitBase. THE CONCEPTS AND PHILOSOPHY Simplicity, efficiency and speed are the guiding principles behind SplitBase, and the reasons I prefer to program in Delphi. Delphi allows the creation of powerful applications with all their functionality confined in a single small executable file. In many other popular programming environments, a "Hello World" program necessitates a complex multi megabytes sized installation program in order to run in a user's computer. Dynamic link libraries are a wonderful concept that ran out of control. They often create serious conflict or even worse, overwriting them sometimes may cause other programs to crash or behave erratically. I wanted to develop a structure that was simple enough for average programmers to comprehend and modify easily according to their needs. That structure should be efficient, fast and also powerful enough to be used in real world situations. By efficient I mean it should be able to manage substantial amount of data (at least one million records) safely and accurately. Fast refers to the ability to execute available transactions instantaneously (one fifth of a second or less). Efficiency and speed should not rely on the power of today's processors and other sort of equipment, but rather should be achieved on minimal machines such as the original IBM PC XT and AT with all their relative limitations. I believe that with SplitBase I have achieved my goals and I humbly present it to your judgment while hopping that it might be useful to other programmers and that they may improve upon the quality and functionality of SplitBase according to the preceding guidelines. DESIGN In its current incarnation, SplitBase is a multilevel indexed file structure. Simply put, it is composed of abstract containers of index data or keys as they are more commonly referred to. These keys belong to a structure that attaches them to a simple pointer that relates them to the corresponding records or other lower level containers. When a container reaches its full capacity, it splits into two half filled containers. It is very similar to the B-Tree structure used in many Database indexes, but it is simpler to manage and store. When full, the top-level container, that never splits, indicates that the Database has reached its maximum capacity and cannot accommodate any more records. In its present phase, SplitBase consists of two levels of key containers: a top level with a single container and a secondary level with multiple containers. The top level contains a duplicate of the first key of each secondary container. Attached to that key is a pointer to the physical position in the database file of the related secondary level container. When a secondary container splits or when its first datum changed, that information is updated in the top container. Secondary-container keys are attached to pointers to the location of the actual records within the file. The top-level container always resides in random access memory (RAM) when a database file is active. That is whenever it is opened or created. This reduces the number of disk access to only two to locate and retrieve any record. To write new records, disk access is also limited to an average of two or a maximum of three if the top-level container must be updated. At this point, it is important to note that the maximum number of records that can be entered in a two level container system as described above, is equal to the square of the full capacity of a container. In the worst case scenario, when all secondary-level containers are only half full and the top-level container reaches full capacity, that number is reduced by half, thereby fixing the guaranteed minimum number of records that can be entered in the SplitBase system. New data such as containers generated by a split and all new records are always written to the end of the database file and a pointer to their position within that file is attached to the corresponding key. A SplitBase file is therefore a moderately complex linked list of containers and records. In its original state when the file is created, it only contains an empty top container and a header that provides the manager with information about the file structure such as length of records and of fields or the number of deleted records. For the sake of simplicity, I have decided to save all fields in a record as strings that can be easily converted, in most programming languages, to other data type such as numbers, currency, date etc. Data within the containers are automatically sorted in ascending order with a simple insert sort algorithm. This allows for extremely fast binary search of data in RAM. For every search and insertion request, the manager looks up the top container to determine which secondary container holds the position for that specific key. It then uses the corresponding pointer to locate and load that container into RAM. New data are then inserted into their correct position and found data provide the pointers to their actual records for retrieval. The whole system is then updated if necessary with very minimal disk access. The decisive factor in determining container capacity has been available memory in our targeted minimal computer, the XT and AT version of the IBM PC. Both machines have been plagued by the 64k block limit, which was chosen as the maximum physical size of a container. On the other hand since only two containers are present in memory during data manipulation this will allocate for more than enough room in the limited memory of these machines (640k max.) for the records, the database manager, the operating system and other programs. In our original model, we propose a maximum key length of 26 bytes, which is appropriate for the majority of real life situations where keys are usually simple and short such as a social security number, a last name, a phone number, a zip code etc. This choice provides a capacity of 2000 key/pointer combinations yielding to databases of a maximum of 4 million records and a minimum of 2 million records if data was entered sequentially resulting in all half full split indexes by the time the top container reaches its full capacity. If you did not understand anything in what was said in this section, you should not be programming databases anyway. However the functions in SplitBase are few and much simpler than with other databases; hence you should be able to use it without too much difficulty. IMPLEMENTATION SplitBase is built around a small set of Boolean functions. It is composed of 29 structurally identical functions that return true when the function succeeds and false otherwise. Only 20 of those functions are necessary for a programmer to know in order to use SplitBase. Many of these routines are not structural but complementary and were implemented only for the convenience of the programmer and the integrity of the system. All the others are mainly basic but optimized file manipulation routines. All functions are written in simple, object free turbo Pascal syntax with very few exceptions. This simplicity makes it extremely easy to port SplitBase to other programming environments and early versions of turbo Pascal. It also renders the system compatible with current and future releases of Delphi with practically no modifications. It could even be adapted as easily to run in the original Pascal language but I would not recommend anyone to do so simply because Wirth's Pascal did not allow the programmer to access files and memory directly. This would result in a very slow system and would definitely not meet our previously stated standards of speed and efficiency. In this current version, SplitBase is released in two formats. An include file (splinc.pas) to be included in Turbo Pascal or Delphi programs, and a component (Tspl) that can be added to practically all current releases of Delphi and most likely all future implementations unless they contain some very dramatic structural modification in their basic syntax or functionality. INSTALLATION For the include-file version, just add the compiler directive {$I Splinc.pas} at the top of your Turbo Pascal programs or at the beginning of the implementation section of your Delphi projects. For the component version, install the file spl.pas as a new component according to the methods available in your version of Delphi. You can install it in an existing package or a new one preferably Tspl. At any rate, if you follow the right steps you should end up with a color logo of SplitBase in the "Samples" palette of your Delphi environment. Drag and drop that logo into your forms to use the Spl component that will provide you with all the functionality of SplitBase. The file Splinc.pas is located in the "\Splbase\Include" sub directory of the original release of SplitBase. The component Spl.pas is located in the "\Splbase\Class" sub directory of the same release of SplitBase. PROGRAMMING SplitBase is very simple to program. The following is a brief description of the necessary functions as well as basic principles that should be observed to safely and easily implement these routines. It is the programmer's responsibility to allocate memory space for the system by using InitSpl and ReleaseSpl at the beginning and end of your program. You must also initialize system variables by using InitBase before creating or opening any database file. You have to ensure that your function calls are successful before proceeding with any database session. This can simply be achieved by checking the return value of all your calls. By following the preceding principles you should enjoy a trouble free experience while programming and using SplitBase. Here are the main functions: InitSpl: Initializes the Splitbase system by allocating memory to accommodate the top and secondary system containers. Initspl must be called at the beginning of your programs. If it fails, all other calls to the system must be cancelled. ReleaseSpl: Releases allocated memory back to the operating system. This function must be called at the end of your programs. Failure to do so will prevent other programs to use that portion of memory reserved for SplitBase. InitBase: Initializes important variables used by Splitbase. Initbase must be called before creating or opening SplitBase files. Failure to do so may generate unpredictable errors or otherwise compromise the stability of the system. SetSpl: This is a very important function that you must use to tell SplitBase the size of each field in a new SplitBase file. It must be called before creating a new database and uses a string as parameter. This string contains the size of all fields that define the record structure. Each field size is described with three characters starting with the first field in the record. Fields which sizes are less than 100 should be filled with zeroes on the left of its allocated portion. For example, a field of length 3 followed by a field of length 12 should be written as follows: '003012'. CreateSpl: This function receives a string parameter to create a new SplitBase file. The parameter is the name of the file to be created without any extension. The system will add the proper extension to it, currently (.spd), and create the file if possible. The functions InitBase and SetSpl must be called to set the necessary variables prior to calling CreateSpl. OpenSpl: This function requires a string parameter to open a SplitBase file. The parameter is the name of the file to be opened without any extension. The system will add the proper extension to the name, currently (.spd), and open the file if possible. A call to the InitBase function must be made to set all the necessary variables prior to calling OpenSpl. AddField: This function uses two parameters, a string and an integer, to insert a field into the record structure of the SplitBase system. The string contains the value of the field and the integer its index. All fields must be inserted before saving a record into the file. GetField: This function uses an integer parameter to retrieve a field from the record structure of SplitBase. The parameter contains the index of the field. A record must be retrieved from the data file before using this function. Getfield must be called once for each field in the current record. If the function fails, SplitBase returns an empty string. AddRec: This function saves a record into the SplitBase file. It receives the key attached to that record as a string parameter. All fields must have been inserted into the record structure prior to calling AddRec. GetRec: This function retrieves a record from the data file. It receives the key attached to that record as a string parameter. If the record is found, all fields are loaded into the record structure of SplitBase. DelRec: This function deletes the current record from the data file. A record must be active in memory in order to use DelRec. DelRec has no parameters. ModRec: This function updates the current record in the data file. It actually deletes that record from the file then adds the modified record immediately. It uses a string parameter as key. FirstRec: Locates and retrieves the first indexed record. LastRec: Locates and retrieves the last indexed record. NextRec: Locates and retrieves the next indexed record. PrevRec: Locates and retrieves the previous indexed record. RecCount: Returns the total number of records in the data file. ActiveDb: Returns true if a data file is active or opened. ActiveRec: Returns true if a record is active or loaded. DBEmpty: Returns true if the current data file is empty. Many variables are important in the use of SplitBase. Most of them should never be manipulated directly by the programmer. However, there are a few such as Splerr that the programmer should check regularly. Splerr returns the last error number and description. These values are returned with the following calls: splerr.recnum and splerr.recstr. In the component version the properties ErrorNumber and ErrorString must be accessed in lieu of splerr. The variable CurDtb or its equivalent CurrentDB contains the name of the current data file. The allrec variable contains a lot of valuable information such as the number of fields in the current SplitBase record. You may read it with allrec.size or with FieldCount from the SplitBase component. Finally limrec is a number you can set to limit the number of record that may be entered into the data file. For the component, use Reclimit instead. RELATED PROGRAMS AND TESTS The original release of SplitBase comes with two test programs Split and Split2. These two programs are identical in looks and functionality except that the former uses the include file and the other the component. These programs allow a user to test the major functions of SplitBase. They contain buttons to create new data file and open existing ones. Other buttons also allocate for adding new records as well as finding and deleting existing data within a splitbase file. They also provide navigation buttons to locate the first, last, next and previous records. Finally a generate button will automatically insert 1 million records into a specific data file. That file is made of a two fields record and is called test.spd. It is automatically created and opened if you click the related buttons in the program. Generate will automatically add 1 million even numbers from 2 to 2 million to that file. For fun and to simulate data entry overhead, generate will convert all those numbers into their English spelling and put the actual number in the first field and its spelling in the second before saving the record in the data file. These two fields are also output in two edit fields for added overhead. An algorithm is used that forces SplitBase to update the top container every four records that in real life applications should only happen on average every thousand records. Test on a 667 megahertz custom built computer equipped with a 5400 RPM Ultra ATA/66 hard drive yield a result of 139 records per second at completion of generate. The function can be stopped at any time by clicking on the 'Stop' button. FUTURE EXPLORATIONS Where do we go from here? Well from the basic functionality of the system many routes may be considered. One of the more obvious may be to increase the number of records that the system can handle. The easiest way to do this might be to hold the keys in the top and secondary containers into an ANSI string just like 'rechld' that holds the fields in any given record. The two-gigabyte theoretical limit claimed by Borland is well beyond anything we might need in any real world situations. The LongInt and Int64 Problem The only other hurdle is the 2-gigabyte limit of the longint type that is used to determine file sizes and to locate records and containers within the data file. This problem may be addressed by using a blocking factor greater than 1 for creating and opening data files. Two hundred fifty-six (256) and one thousand twenty four (1024) may be good values to experiment with. The only problem might be in synchronizing container and record size with the chosen blocking factor. It would be easy to add fillers like in the good old days when Cobol was king. This will certainly add to the waste in a data file. But who cares in times when 80-gigabytes hard drives cost less than 200 dollars in most major computer stores. An easier way would be to use the new int64 type that appeared with version 4 of Delphi. But Borland is mute about using it with the reset, rewrite and seek procedures. Int64 can hold up an integer the size of 2 to the power 63. What about garbage collection? Much like in other databases, deleted records and containers are not actually erased from the file. They just sit there useless with no actual pointers relating to them. Heavy usage of delete transactions tends to increase considerably the amount of garbage within the data file. A garbage collection or rebuild routine can easily be implemented by using a new data file and add all the live records from the previous file to it. Multiple indexes data files may be more easily implemented by saving the indexes and the records in different files. It will actually allocate for more records if longint is used. This is also the case if one would like to emulate the functionality of relational databases. The straightforward structure of the data file without the containers will make it easy to emulate most general functions of a relational database system. In the case of a multi-user environment and client/server implementation, a separate manager should be written to handle calls from the different users. The most important issue in multi-user system is probably the ability to lock records. The manager can easily implement this by adding an extra field to records in data files that holds the status of that record. A simple protocol should also be implemented between the manager and the client programs to properly handle transactions. These were a few suggestions for programmers to modify the SplitBase system to easily achieve their goals. However, we believe that SplitBase is very useful as is and can be adapted to solve a great deal of real life problems. Particularly in the retail industry Splitbase may be easily and effectively used without modification to help manage most problems such as stock, accounting, customers etc... How many times did the clerks used complex data analysis to locate your record or a specific item in the store? They usually type a single key string in order to locate the needed information. Most stores even the major chains hold less than 2 million items (I know I usually go around and count them :-). Many cities in America and around the world have less than two millions citizens, most libraries hold less than two million books and the list can go on and on about real life situation in which SplitBase may be used without modifications. If you come up with important additions to SplitBase you are free to publish them along with the original distribution of SplitBase with all files present and unchanged. I would recommend that you follow the guidelines stated at the beginning of this file. However, if you have a REALLY important addition that uses some arcane procedures or an extremely complex routine, well go ahead, publish it with the restrictions stated above as far as keeping original files intact. COPYRIGHT AND LICENSES SplitBase is not based on any previous work other than basic computer Science and Data Structure principles. It is not a public domain material but a copyrighted publication of Leon O. Romain. However, it is freeware and is being released under the GNU license agreement. See the file licence.txt for more information. You may use the SplitBase system as you see fit in your personal, professional or commercial programs without paying any loyalty to the author providing you relieve him of any and all liabilities that may result from the use of SplitBase. In fact SplitBase is released as is. The author decline all responsibility and liability from damages either direct or incidental that may result from the use of SplitBase and its accompanying routines and other software. KNOWN BUGS There are no known bugs. If any are found, they should be easily corrected due to the simplicity of SplitBase and its modular structure. ADDENDUM I checked the int64 type and it worked perfectly with the reset and seek functions on files much bigger than 4 gigabytes. This yields the possibility of creating SplitBase systems capable of accessing billions of records with very little modification to the original program when using Delphi version 4 or later. By changing the type of the necessary variables from longint to int64 and by modifying the index field of the splitbox record to work as an ansistring instead of an array is all that it takes for SplitBase to access those billions of records. I also added an activex version of SplitBase to the original release. It works fine except for the fact that the icon does not appear on the form when selected. The necessary files are located in the activex subdirectory and the file SplBaseXControl1.ocx must be registered before using the control. This opens the door for visual basic, c++ and other programming environments to use SplitBase before the advent of native implementations of the system. OCX controls may be registered using the Regsvr32.exe utility usually located in the windows or system directories. COMMUNICATION For any comments, suggestions, corrections, criticism, bug reports and other communications please Email me at leon@kafou.com. SplitBase and the SplitBase logo are trademarks of Leon O. Romain. The SplitBase Data Management Systems and this user guide are copyrighted materials of Leon O. Romain. Copyright (c) 2001 Leon O. Romain. / \ / \ S | \ / | | \ / | P L / \ | / \ / \ | / \ I T | \ / | \ / | | \ / | \ / | B A S E \ | / \ | / \ | / \ | / DATA MAMAGEMENT SYSTEMS -----------------------