home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!pipex!bnr.co.uk!uknet!comlab.ox.ac.uk!oxuniv!mjl
- From: mjl@vax.oxford.ac.uk
- Newsgroups: comp.databases
- Subject: Cancer Research database: need advice.
- Message-ID: <1992Dec21.111601.10931@vax.oxford.ac.uk>
- Date: 21 Dec 92 11:16:00 GMT
- Organization: Oxford University VAX 6620
- Lines: 206
-
- I wonder if the experience of the net could offer some opinions
- on our problems with finding a suitable database solution. I give
- below some background and experience so far.
-
- For many years my Group(CCRG) has maintained a large 'database'
- consisting of data on childhood cancer cases in UK, known as the
- National Registry of Childhood Tumours(NRCT). I use '' around
- database because the data is held in a sequential 'binary'
- formatted file on our VMS Vax. Numeric data is held in 4 byte
- integer format, and most character information is held in ASCII
- after ciphering. This character information is stored as
- truncated delimited strings. Thus the space used is efficient
- with no wasted white space.
-
- There are over 50,000 records: each has 160 4 byte integer values
- and 25 strings. The integers cover such data as birth dates,
- region codes and medical diagnostic codes. The strings include
- normal names, addresses, postcodes(separately) etc.
-
- This file is managed by means of several third generation
- language programs which carry out extensive checking and data
- validation. There are probably hundreds of lines of code which
- can be drawn upon depending what is being altered or
- added/deleted. This is far beyond the typical range checks and
- simple rules which I believe can be built into most DBMS systems.
- Many of the 160 variables can be inter-related, and even strings
- have to be checked for validity, eg postcodes are quite complex.
- Checks between the rows(cases) do not currently get routinely
- performed in the code, although we wish to do this in the future
- for checking duplicate entries etc. Currently this is done by a
- combination of interactive searching(see below) and manual
- methods. All the programs work by batch running data files
- containing, often, large numbers of cases( eg 1000), which may be
- existing cases being edited, or new cases. This is the way CCRG
- validates and maintains the integrity of the 'database'. Case
- problems are flagged and dealt with individually, possibly by
- different people, and then the job resubmitted, and only allowed
- onto the datafile when all problems are accounted for. Proof
- reading also plays an important part. The result is a data file
- of great integrity, which could not be maintained by interactive
- _direct_ data entry alone. Much data entry is of course done into
- satellite databases which are then used to create the batch runs,
- but nothing goes into the real datafile until 100% verified.
-
- CCRG wishes to change to the use of a proprietary database to
- replace the entire reliance on 3rd generation language programs,
- and in particular to take advantage of DBMS query facilities. For
- example, we already are maintaining an INGRES database containing
- an exact copy of the contents of the datafile. This allows us to
- test out use of Ingres for queries in parallel with the old
- system. Although we used 3rd generation programs in the past to
- do case selection etc for research, the ability to run simple SQL
- queries on the data, AND do interactive searching is very
- welcome. Currently the data on file is still _managed_ with the
- 3rd generation programs, and data recopied across to INGRES at
- intervals.
-
- This is where we have found problems which seem to be inherent in
- SQL or Ingres, and which make us think that Ingres( or SQL) is
- not the best answer. In fact, I am now forming the opinion that
- using Ingres would be so restrictive that we would be better re-
- writing the 3rd generation programs (which incidentally are
- written in Algol68. Algol68, although laughed off by many as an
- obsolete joke, is in fact a very powerful language on a par with
- structured languages like C, but scorned since it is out of
- fashion).
-
- Some of the problems we found with Ingres are:
-
- 1) The complex type of queries we would often write with ease
- in Algol become totally bogged down using SQL. All the
- advice points to approaching the query a different way. So
- we rewrite our nice Boolean based Algol code creating
- temporary tables all over the place willy-nilly. This seems
- to be hopelessly unwieldy, and it is extremely inefficient.
- SQL seems to be far too weak a tool to replace our 3rd
- generation queries. We have also got incorrect results from
- particular approaches, something which we never experienced
- with Algol68! It had bugs but did not return subtle
- inaccuracies. (see recent Subject 'HELP! SQL Ingres problem'
- in Comp.Databases.Ingres) Our users have to have complete
- confidence in the database accuracy of course.
-
- 2) A user creating a table cannot himself make his table
- accessable to anyone else. The DBA cannot do this either,
- the table must be recreated by the DBA.
-
- 3) On VMS anyway, there is no way of controlling the CPU
- resources used by a batch running SQL job. Our queues have
- CPU limits but the server simply carries on regardless. Thus
- a bogged down user job(of which there are plenty) can sit in
- a queue monopolising it, and the server, for hours.
-
- 4) SQL just does not seem to be designed to be used in a
- research type environment where serious large queries, many
- of which must be batch run, occur
-
- 5) There is a gap, from the point of view of easy report
- writing, between the simplicity of SQL and the complexity of
- writing a Report. A user needs better control over a default
- report layout. This can be worked around, if rather in-
- ellegantly, through ESQL.
-
- Of course, I have not mentioned the obvious disadvantages of the
- vast amount of extra space required by Ingres tables. Whereas
- the datafile took up about 30 Mb; the Ingres version uses 80Mb.
- Some of this is attributable to many string slots which are
- partially empty. The length of one string variable can vary
- between zero and 60 in some cases, with an average of say 25.
- This could be improved by a different breakdown of tables.
-
- My question is this.
-
- Does anyone with experience of other SIMILAR Size databases in a
- research environment have any suggestions for what alternative
- dbms we should look at and why?. This should be in the context of
- our main problems as I see them:
-
- 1 Complex queries are not at all uncommon, see long example
- below. Should be PROVEN for this type of work.
-
- 2 Batch running as well as interactive queries essential
-
- 3 Must integrate with a 3rd generation language for the
- integrity checking, unless someone can suggest a DBMS so
- good it can handle this sort of thing.
-
- Any suggestions should be for _relational_ systems. Although we
- really would prefer something which will run under VMS, I would
- not impose that as a definite restriction if all else was viable.
-
- Many thanks in advance for any suggestions.
-
- Martin Loach
- Childhood Cancer Research Group
- University of Oxford
- England
-
- e-mail MJL@UK.AC.OXFORD.CHILD-CANCER-RESEARCH
-
-
- Below, by way of reference, I give an example complex query
- showing the sort of functionality we need, not that infrequently.
-
- ----------------------------cut here--------------------------
-
- IF
- BOOL flag code ok := v[nhscr flag] < 1;
- BOOL carcinoid app :=
- IF v[med type icdo] /= -1 AND v[med type icdo] % 100 = 824
- AND (v[med site icdo] % 10 = 159 OR v[med site icdo] = 1535)
- THEN TRUE
- ELSE FALSE
- FI;
- BOOL nat :=
- IF v[newdiaganniv] = 141 THEN TRUE
- ELSE FALSE
- FI;
- BOOL ca app or nat := carcinoid app OR nat;
- BOOL is a scots case :=
- IF v[registry] >= 21 AND v[registry] <= 25
- THEN TRUE
- ELIF v[registry] /= -1
- THEN FALSE
- ELIF is a scots hospital(v[trtmt centre])
- THEN TRUE
- ELIF is a scots hospital(v[uk reg hosp])
- THEN TRUE
- ELSE {NOT a Scots Trial or UKCCSG case, so . . . } FALSE
- FI;
- BOOL pre 71 alive not flag 0 := v[anniv year] < 71
- AND v[anniv year] /= -1
- AND v[death year] = -1 AND v[nhscr flag] /= 0;
- BOOL foreign ukccsg only := v[residence] /= -1
- AND v[residence] /= "EWS" AND v[residence] /= "BFPO" ;
- BOOL is 71 to 87 := IF 71 <= v[anniv year]
- AND v[anniv year] <= 87 THEN TRUE
- ELIF v[anniv year] /= -1 THEN FALSE
- ELSE 71 <= v[diag year] AND v[diag year] <= 87
- FI;
- BOOL is 71 to 87 without dc := is 71 to 87
- AND v[dc status] = -1;
- BOOL dc with poor birth date:=
- v[birth day] = -1 AND v[death year] > 69
- AND v[dc status] /= -1 ;
- BOOL pre 71 := IF v[anniv year] /= -1
- AND v[anniv year] < 71 THEN TRUE
- ELIF v[anniv year] >= 71 THEN FALSE
- ELSE { anniv year = -1 so . . . } v[diag year] < 71
- AND v[diag year] /= -1
- FI;
- BOOL pre 71 with 71 on death over 14 := pre 71
- AND v[death year] >= 71
- AND v[age at death] > 14;
- NOT is a scots case AND flag code ok AND NOT ca app or nat
- AND NOT foreign ukccsg only
- AND(is 71 to 87 without dc OR dc with poor birth date
- OR pre 71 with 71 on death over 14
- OR pre 71 alive not flag 0)
- THEN
-
- {list etc}
-
-
- Note- things like 'is a scots hospital' are already defined
- elsewhere for common use.
-