NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / comp / database / 8634 < prev next >

Wrap

Internet Message Format | 1992-12-22 | 9.3 KB

Path: sparky!uunet!pipex!bnr.co.uk!uknet!comlab.ox.ac.uk!oxuniv!mjl From: mjl@vax.oxford.ac.uk Newsgroups: comp.databases Subject: Cancer Research database: need advice. Message-ID: <1992Dec21.111601.10931@vax.oxford.ac.uk> Date: 21 Dec 92 11:16:00 GMT Organization: Oxford University VAX 6620 Lines: 206 I wonder if the experience of the net could offer some opinions on our problems with finding a suitable database solution. I give below some background and experience so far. For many years my Group(CCRG) has maintained a large 'database' consisting of data on childhood cancer cases in UK, known as the National Registry of Childhood Tumours(NRCT). I use '' around database because the data is held in a sequential 'binary' formatted file on our VMS Vax. Numeric data is held in 4 byte integer format, and most character information is held in ASCII after ciphering. This character information is stored as truncated delimited strings. Thus the space used is efficient with no wasted white space. There are over 50,000 records: each has 160 4 byte integer values and 25 strings. The integers cover such data as birth dates, region codes and medical diagnostic codes. The strings include normal names, addresses, postcodes(separately) etc. This file is managed by means of several third generation language programs which carry out extensive checking and data validation. There are probably hundreds of lines of code which can be drawn upon depending what is being altered or added/deleted. This is far beyond the typical range checks and simple rules which I believe can be built into most DBMS systems. Many of the 160 variables can be inter-related, and even strings have to be checked for validity, eg postcodes are quite complex. Checks between the rows(cases) do not currently get routinely performed in the code, although we wish to do this in the future for checking duplicate entries etc. Currently this is done by a combination of interactive searching(see below) and manual methods. All the programs work by batch running data files containing, often, large numbers of cases( eg 1000), which may be existing cases being edited, or new cases. This is the way CCRG validates and maintains the integrity of the 'database'. Case problems are flagged and dealt with individually, possibly by different people, and then the job resubmitted, and only allowed onto the datafile when all problems are accounted for. Proof reading also plays an important part. The result is a data file of great integrity, which could not be maintained by interactive _direct_ data entry alone. Much data entry is of course done into satellite databases which are then used to create the batch runs, but nothing goes into the real datafile until 100% verified. CCRG wishes to change to the use of a proprietary database to replace the entire reliance on 3rd generation language programs, and in particular to take advantage of DBMS query facilities. For example, we already are maintaining an INGRES database containing an exact copy of the contents of the datafile. This allows us to test out use of Ingres for queries in parallel with the old system. Although we used 3rd generation programs in the past to do case selection etc for research, the ability to run simple SQL queries on the data, AND do interactive searching is very welcome. Currently the data on file is still _managed_ with the 3rd generation programs, and data recopied across to INGRES at intervals. This is where we have found problems which seem to be inherent in SQL or Ingres, and which make us think that Ingres( or SQL) is not the best answer. In fact, I am now forming the opinion that using Ingres would be so restrictive that we would be better re- writing the 3rd generation programs (which incidentally are written in Algol68. Algol68, although laughed off by many as an obsolete joke, is in fact a very powerful language on a par with structured languages like C, but scorned since it is out of fashion). Some of the problems we found with Ingres are: 1) The complex type of queries we would often write with ease in Algol become totally bogged down using SQL. All the advice points to approaching the query a different way. So we rewrite our nice Boolean based Algol code creating temporary tables all over the place willy-nilly. This seems to be hopelessly unwieldy, and it is extremely inefficient. SQL seems to be far too weak a tool to replace our 3rd generation queries. We have also got incorrect results from particular approaches, something which we never experienced with Algol68! It had bugs but did not return subtle inaccuracies. (see recent Subject 'HELP! SQL Ingres problem' in Comp.Databases.Ingres) Our users have to have complete confidence in the database accuracy of course. 2) A user creating a table cannot himself make his table accessable to anyone else. The DBA cannot do this either, the table must be recreated by the DBA. 3) On VMS anyway, there is no way of controlling the CPU resources used by a batch running SQL job. Our queues have CPU limits but the server simply carries on regardless. Thus a bogged down user job(of which there are plenty) can sit in a queue monopolising it, and the server, for hours. 4) SQL just does not seem to be designed to be used in a research type environment where serious large queries, many of which must be batch run, occur 5) There is a gap, from the point of view of easy report writing, between the simplicity of SQL and the complexity of writing a Report. A user needs better control over a default report layout. This can be worked around, if rather in- ellegantly, through ESQL. Of course, I have not mentioned the obvious disadvantages of the vast amount of extra space required by Ingres tables. Whereas the datafile took up about 30 Mb; the Ingres version uses 80Mb. Some of this is attributable to many string slots which are partially empty. The length of one string variable can vary between zero and 60 in some cases, with an average of say 25. This could be improved by a different breakdown of tables. My question is this. Does anyone with experience of other SIMILAR Size databases in a research environment have any suggestions for what alternative dbms we should look at and why?. This should be in the context of our main problems as I see them: 1 Complex queries are not at all uncommon, see long example below. Should be PROVEN for this type of work. 2 Batch running as well as interactive queries essential 3 Must integrate with a 3rd generation language for the integrity checking, unless someone can suggest a DBMS so good it can handle this sort of thing. Any suggestions should be for _relational_ systems. Although we really would prefer something which will run under VMS, I would not impose that as a definite restriction if all else was viable. Many thanks in advance for any suggestions. Martin Loach Childhood Cancer Research Group University of Oxford England e-mail MJL@UK.AC.OXFORD.CHILD-CANCER-RESEARCH Below, by way of reference, I give an example complex query showing the sort of functionality we need, not that infrequently. ----------------------------cut here-------------------------- IF BOOL flag code ok := v[nhscr flag] < 1; BOOL carcinoid app := IF v[med type icdo] /= -1 AND v[med type icdo] % 100 = 824 AND (v[med site icdo] % 10 = 159 OR v[med site icdo] = 1535) THEN TRUE ELSE FALSE FI; BOOL nat := IF v[newdiaganniv] = 141 THEN TRUE ELSE FALSE FI; BOOL ca app or nat := carcinoid app OR nat; BOOL is a scots case := IF v[registry] >= 21 AND v[registry] <= 25 THEN TRUE ELIF v[registry] /= -1 THEN FALSE ELIF is a scots hospital(v[trtmt centre]) THEN TRUE ELIF is a scots hospital(v[uk reg hosp]) THEN TRUE ELSE {NOT a Scots Trial or UKCCSG case, so . . . } FALSE FI; BOOL pre 71 alive not flag 0 := v[anniv year] < 71 AND v[anniv year] /= -1 AND v[death year] = -1 AND v[nhscr flag] /= 0; BOOL foreign ukccsg only := v[residence] /= -1 AND v[residence] /= "EWS" AND v[residence] /= "BFPO" ; BOOL is 71 to 87 := IF 71 <= v[anniv year] AND v[anniv year] <= 87 THEN TRUE ELIF v[anniv year] /= -1 THEN FALSE ELSE 71 <= v[diag year] AND v[diag year] <= 87 FI; BOOL is 71 to 87 without dc := is 71 to 87 AND v[dc status] = -1; BOOL dc with poor birth date:= v[birth day] = -1 AND v[death year] > 69 AND v[dc status] /= -1 ; BOOL pre 71 := IF v[anniv year] /= -1 AND v[anniv year] < 71 THEN TRUE ELIF v[anniv year] >= 71 THEN FALSE ELSE { anniv year = -1 so . . . } v[diag year] < 71 AND v[diag year] /= -1 FI; BOOL pre 71 with 71 on death over 14 := pre 71 AND v[death year] >= 71 AND v[age at death] > 14; NOT is a scots case AND flag code ok AND NOT ca app or nat AND NOT foreign ukccsg only AND(is 71 to 87 without dc OR dc with poor birth date OR pre 71 with 71 on death over 14 OR pre 71 alive not flag 0) THEN {list etc} Note- things like 'is a scots hospital' are already defined elsewhere for common use.