home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!spool.mu.edu!agate!biosci!lhc!object!ostell
- From: ostell@object.nlm.nih.gov (Jim Ostell)
- Newsgroups: bionet.software
- Subject: Beyond GenBank
- Message-ID: <1992Dec22.133801.20561@nlm.nih.gov>
- Date: 22 Dec 92 13:38:01 GMT
- Sender: news@nlm.nih.gov
- Organization: National Library of Medicine
- Lines: 277
- X-Newsreader: Tin 1.1 PL4
-
-
- NCBI is aware of many of the issues raised in this
- discussion. Rather than being "afraid to face these tough issues" as
- we have been moving toward possible solutions
- internally at NCBI for the past two years, as many people are aware who
- have offered to participate constructively in the process. In addition, you all must realize that
- NCBI has only been given authority over GenBank this October. It will
- take more than a couple of months to make substantial progress.
-
- Further, GenBank is an existing international collaboration, so
- any changes need to be acceptable to the collaborative groups, or
- invisible to them. Finally, there is a very large contingent of users
- who may be less cognizant of the problems with GenBank today, who are
- very invested in not having the familiar change. Their needs must be
- addressed by any strategy taken by NCBI as well.
-
- Rather than quoting individual comments, now that a great deal of
- discussion has transpired, I would like to focus on issues, comment on
- what we see the problems to be, then present our plan for addressing the
- issue.
-
- Requirements for Creating New Databases:
-
- A central theme in the NCBI strategic plan is support for databases
- built by outside domain experts, yet incorporated into unified user view
- in the central sequence databases. In order to accomplish this there
- must be:
-
- 1) a standard computer readable data exchange language. It must be
- richer and more flexible than the flatfile format, formally correct as a
- language, yet not force a particular hardware platform, programming
- language, or database technology on the scientific community.
-
- 2) a data specification which rigorously defines core objects (such
- as sequences, maps, coding regions) yet allows both the addition of
- custom extensions to existing defined objects and the creation of
- totally new objects. A migration path must exist for moving the user
- defined objects to the core standard set as certain definitions prove to
- be of widespread utility by dint of experience.
-
- 3) stable identifiers for sequences must be supported by central
- databases. The somewhat casual relationship of LOCUS and ACCESSION with
- particular sequence is intolerable if other investigators build
- databases which cite locations on these sequences. The ability to
- stably cite features is even more complex, see discussion below.
-
- 4) the data model must allow incorporation of data of various types
- from different sources. The same data must be able to participate in
- different views of the database (eg. a "typical" beta globin region vs.
- all original pieces of sequence containing a beta globin coding region
- in the database).
-
- NCBI Approach to Problem 1:
-
- We have chosen a data exchange language called ASN.1, Abstract
- Syntax Notation 1. It is an International Standards Organization
- standard (ISO 8824, 8825) for exchange of structured data in a formal,
- yet machine and implementation independent way. This is not another ad
- hoc file format invented for a special purpose by biologists. It
- separates the definition of the data structure from any particular block
- of data. This means that the specification is necessary and sufficient
- to describe data conforming to it from any source. The specification is
- not a passive documentation of a file format, but is used by software to
- actively check a data stream for accuracy. Anyone who has been parsing
- flatfiles will appreciate the value of full, automatic data checking by
- machine.
-
- ASN.1 supports modular specification. That is, one may have a
- module specifying bibliographic entities. This module can then be
- simply referenced by other modules, such as a sequence module or a
- MEDLINE module, rather than coming up with a new bibliographic component
- for every new database. Like modular programming, modular data
- specification has profound benefits for data and code reusability and
- maintainability. The modular design also greatly facilitates linkage
- between databases because they may differ in overall content, but share
- certain defined entities such as literature citations or sequence
- identifiers, which will be compatible with each other, and thus provide
- an avenue for automatic linkage of the other data elements.
-
- In order to facilitate use of ASN.1, NCBI provides software tools
- for developing specifications, validating them, automatically generating
- parsers for any specification, and tools for reading and writing ASN.1
- structured data that run on 14 different hardware and software
- platforms. See below for tool availability.
-
- 2) NCBI Approach to Problem 2:
-
- We have done specifications in ASN.1 for biological sequences,
- including nucleic acids, proteins, and maps of various types. We have
- an extensive specification for bibliographic information, including
- articles, journals, books, thesis, manuscripts, patents, etc., which
- conforms to the ANSI standard for bibliographic citations. We have a
- specification for MEDLINE. We have specifications for a variety of
- features, for alignments of sequences, and for graphs of sequence
- properties.
-
- The specification has been tested by mapping all of GenBank, EMBL,
- DDBJ, SWISSPROT, PIR, and PRF into ASN.1 conforming to the spec. We
- have also mapped MEDLINE, and the sequences from the Brookhaven
- structural database, among other things. Thus the specification is a
- superset and a unification of most major existing sources of sequence and
- their annotations. Much of this has been appearing on the Entrez disks
- for some time. More will appear over the year.
-
- In addition to integrating the sequence databases themselves into a
- single entity, we have also been addressing the issue of contributed
- information ABOUT the sequences. We worked with Philipp Bucher, author
- of the Eukaryotic Promoter Database (EPD) on the TxInit (transcription
- initiation) feature definition. He produces EPD as an ASN.1 formatted
- feature table on every release of EPD.
-
- The ASN.1 specification allows a sequence to have multiple feature
- tables on the same entry, with attribution to the source. So we will be
- adding EPD information automatically to the sequence data appearing in
- our ASN.1 releases in the near future. This allows a very rich
- annotation to be provided by a specialist on their own local system, but
- to be automatically presented in a user view as an integrated part of
- the database. It is our plan to expand this aspect in a big way once we
- have stabilized the sequence data itself (remember we have been GenBank
- only a couple of months).
-
- The ASN.1 spec supports a "User-defined Object". This allows the
- attachment of structured data defined by the user both to existing
- features (as an extension) (eg. a CdRegion with an extension with more
- information about the translation process), or as completely new feature
- type when something is so new it is more than an extension to an
- existing type. User-defined types are transparent in ASN.1, yet support
- a completely structured datatype that user code can operate on. It
- provides an unmoderated forum for new ideas, which code can ignore or
- take advantage of. If a user defined type becomes popular or important,
- then there already exists a definition for it and possible pre-existing
- data in the database which could be reliably converted to a new standard
- type.
-
- NCBI Approach to Problem 3:
-
- In order for outside scientists to cite a sequence location and then
- compare it with other data at a later time, the database must provide
- stable identifiers for sequences. It must be understood that GenBank
- itself is an international collaboration, and, in addition, we are
- adding other sequence data not traditionally part of GenBank such as
- proteins. This means NCBI cannot simply stabilize sequence ids by fiat.
-
- However, we are building a database called ID, whose job it is to
- impose stable IDs, called GI (GenInfo) numbers. A GI is an arbitrary
- unsigned integer which identifies a specific sequence. If anything in
- the sequence changes (a 1 bp change is enough) it is assigned a new GI.
-
- Bioseqs, in the ASN.1 definition, can have multiple ids. So, an
- entry that comes from EMBL, say X12345, would entry ID the first time
- and be assigned a GI, say 10. In the ASN.1 form of that Bioseq, it
- would have both ids, EMBL X12345, and GI 10. All feature locations,
- etc., would be converted from an EMBL id to GI 10 on input. Then
- suppose ID gets an update, which is only to the feature table of X12345.
- ID looks up the old X12345, compares the old sequence to the new, and,
- since they are identical, gives the new entry GI 10 again. Now, suppose
- the sequence for X12345 is changed, but the accession stays the same.
- When ID compares the new sequence with the old, it sees it changed, so
- it assigns a new GI, say 15. It also adds a history to the ASN.1 form
- of the entry. GI 15 gets a pointer saying that it used to be GI 10. GI
- 10 gets a pointer that says it has been replaced by GI 15. A release of
- Entrez made now, would only have the GI 15 entry. ID would still
- contain both GI 10 and GI 15 however.
-
- A feature submitted that cited GI 10 could reliably be integrated
- into a release on the fly. When GI 10 is replaced by GI 15, the
- contributor of the feature citing GI 10 could be notified that their
- entry may be invalid and to look at GI 15. They could confirm that
- their annotation in fact still applies to GI 15 and resubmit.
-
- When one retrieves from ID based on accession X12345 you get the
- latest entry with that accession, GI 15. However, if you retrieve with
- GI 10, you get the original GI 10 entry, plus the additional information
- that it now has been replaced by GI 15. Thus ID can provide a data
- system which can operate both on the old unstable ID system as well as
- impose a new, parallel stable id system.
-
- NCBI Approach to Problem 4:
-
- In addition to features, the ASN.1 specification supports alignments
- and sequences constructed by assembly of other sequences. This allows
- submission by outside sources of sequence merge information, placement
- of sequences on genetic and physical maps, assemblies of published
- sequences under a "prototypical" representative, and so on. These
- constructs can be used both to provide new insights on sequence
- relationships as well as for an author to provide a history of changes
- to a sequence as it is updated or added to. We are doing a prototype
- project of this sort of thing with the Kenn Rudd E.coli database and
- with Elvin Kabat's Proteins of Immunological Interest. Already in the
- data released in Entrez, we have assembled segmented sequences from
- GenBank into such higher level entities with pointers to their
- components.
-
- Another type of assembly allowed by the ASN.1 specification is the
- grouping of related sequences together. In the current Entrez releases
- we have been grouping nucleic acid sequences and their translated
- protein products together. We will soon begin to build composites where
- the translated protein is replaced by the fully annotated SWISSPROT or
- PIR entry in the integrated view.
-
- This type of assembly by pointer allows construction of a variety of
- higher level views, including the history of an individual sequence,
- merges of sequences, mixed assemblies of partially mapped and partially
- sequenced chromosome regions, collections and alignments of related
- sequences, and so on. Thus, any given sequence could participate in a
- variety of higher level views, and a user can find the view most
- appropriate to their purposes. While this solves a number of problems,
- of course it raises new ones, in particular, how to present a number of
- views for the selection of the most appropriate one.
-
- Consequences of the Design:
-
- This system allows distributed participation. The key to
- participating is to offer data in machine readable form. Corrections to
- splice junctions, additional information on oncogenes, sequence merges,
- integration with maps, can all be submissions to the database. To facilitate
- this process, NCBI has a visitors program which potential collaborators can
- apply for. This allows us to pay for visitors to travel to NCBI, for
- those who do not live in our backyard. Among those who have either paid
- their own way or taken advantage of the visitor program are Amos
- Bairoch, Elvin Kabat, Mitch Sogin (organizing a group of taxonomists to
- help with organisms), Philipp Bucher, Richard Durbin, and many others.
- Anyone with an interest in actually building and contributing sequence
- related databases to the public, please contact me to talk it over. As
- Tom Schnieder knows we have an open invitation to any who wish to
- discuss the future of sequence related databases - all they need do
- is take us up on the offer. Tom... are you listening?
-
- How far along is this? Well, the ASN.1 specification and software
- tools to map GenBank, EMBL, DDBJ, SWISSPROT, PIR, PRF, PDB, EPD, and
- MEDLINE into a single world of data exist, are tested, and have been
- available for ftp for some time on multiple platforms. You can log into
- anonymous on ncbi.nlm.nih.gov,
- cd toolbox/ncbi_tools
- bin
- get ncbi.tar.Z
- quit
-
- This has all the versions, plus (old) postscript documentation. In
- the \asn directory (after you uncompress and untar) you can print *.asn
- to see the full spec. There are demo programs in \demo. The specification has been
- frozen for a year and has a defined modification cycle. In six months,
- suggestions for improvements are entertained. In nine months, the new
- spec is published with sample files. In twelve months the data is
- released conforming to the new spec. In ASN.1 it is generally possible
- to make new specs backward compatible with old specs. The specification
- and software tools have been in use for over a year to process and
- integrate data from the above sources and to produce releases of Entrez
- and GenBank flatfile (a subset of the ASN.1 data).
-
- The ID database is built and is undergoing testing now. With any
- luck we will use it to produce the next release of Entrez. With ID
- built, we can begin the replacement of translated proteins with
- annotated protein database entries, integration of contributed features
- such as EPD, and processing of high level views such as EcoSeq. This
- will begin to appear in the Entrez releases over the next twelve months.
- Obviously, it will also provide the stable sequence IDs that can be used
- by other contributors as well.
-
- We are starting new initiatives to provide stable linkage and
- nomenclature for outside vocabularies as well. We are working with ATCC
- to link Venter's EST sequences with the ATCC numbers for the clones.
- Mitch Sogin is convening a group of taxonomists to try and rationalize
- the taxonomy and organism nomenclature. Specific hooks have been placed
- in the Gene-ref and Prot-ref features to allow links to genetic and
- protein property databases.
-
- Again, NCBI repeats the invitation. If you are willing to do some
- work to move any aspect of this process along, please contact us at
- info@ncbi.nlm.nih.gov. We will be happy to discuss with you how any
- work you have already done might be integrated into the new efforts or
- how projects you plan might be made compatible. For our part, we
- encourage as wide a participation as possible. There is lots to do and
- we welcome any and all to join us...
-
- Jim Ostell
- NCBI
-