home *** CD-ROM | disk | FTP | other *** search
- Newsgroups: bionet.software
- Path: sparky!uunet!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!gatech!udel!darwin.sura.net!welchgate.welch.jhu.edu!danj
- From: danj@welchgate.welch.jhu.edu (Dan Jacobson)
- Subject: Re: intron/exon borders
- Message-ID: <1992Dec30.220952.25847@welchgate.welch.jhu.edu>
- Organization: Johns Hopkins Univ. Welch Medical Library
- References: <30DEC199214434286@aardvark.ucs.uoknor.edu>
- Date: Wed, 30 Dec 1992 22:09:52 GMT
- Lines: 600
-
- In article <30DEC199214434286@aardvark.ucs.uoknor.edu> bfrank@aardvark.ucs.uoknor.edu (FRANK,BART) writes:
- >Can anyone recommend a good program to screen human genomic seqeunces
- >and predict positions of intros/exon borders?
- >
- >Thanks,
- >Bart Frank
- >Internet: BFRANK@AARDVARK.UCS.UOKNOR.EDU
-
- There are three mail servers which do this type of thing, namely
- GRAIL, GENEID, and GENMARK. I am including information about these
- servers below.
-
- Happy Holidays,
-
- Dan Jacobson
-
- danj@welchgate.welch.jhu.edu
-
-
-
- =========================================================================
- ////////////////////////////////////////////////////////////////////////
- =========================================================================
-
-
-
- Welcome to GRAIL (Gene Recognition and Analysis Internet Link)
-
- Grail is an interface to a system which will ultimately provide
- automated gene assembly from DNA sequence data. Currently the
- system provides analysis of protein coding potential of a DNA
- sequence. The coding recognition module (CRM) uses a multiple-
- sensor neural network approach to identify coding exons than are
- at least 100 bases long. In its current configuration the CRM
- identifies 90% of such regions with less than 1 false positive
- coding exon per 5 coding exons indicated. Your success rate will
- depend on a number of parameters including the G/C content of
- your sequence. In general, coding regions in sequences of low
- G/C content are not as well recognized as those in higher G/C.
- Investigation is underway to try and improve the performance
- for low G/C sequences.
-
- This part of the system is specifically designed to locate
- regions of DNA sequence with protein encoding potential. The
- system has been trained to recognize coding regions in Human DNA
- but seems to work well on DNA sequences from other mammals.
- Because the system has not been tested extensively on species
- other than human, no claims are made for the predictions of
- coding potential on DNA's from other species.
-
- To use GRAIL you must first register and get a user ID.
- To become a registered user please send the following
- e-mail message to:
-
- grail@ornl.gov
-
- Register
- Your Name
- Your address
- Your phone number
- your E-mail address
-
- To have sequences analyzed send e-mail to:
-
- grail@ornl.gov
-
- The message will start with the word "sequences" followed by the
- number of sequences you are sending followed by your user ID
- followed by the sequences you wish to have analyzed in the
- following format:
-
- Sequences number_of_sequences your_user_ID
- >seq1name
- AAAAAAAA........
-
- >seq2name
- TTTTTTTT..........
-
- etc.
-
- For the system to return any interpretation the sequence to be
- analyzed must be at least 100 bases long (and not more than
- 100kb). For each sequence the following information will be
- returned:
- 1. The score for the coding potential for each position analyzed
- on each strand (the f-(forward) strand represents the sequence as
- received, and the r-(reverse) strand is the reverse compliment).
- These scores range from 0.0 to 1.0 and a score greater than 0.5
- identifies a region with protein encoding potential. Non-coding
- regions often have a score of 0.000. To reduce the output, only
- regions with scores of at least 0.01 are reported.
- 2. frame. In calculating the coding potential, the system
- calculates the reading-frame which is "preferred" in the window
- over which the calculation is done and this information is
- returned for regions with scores over 0.5.
- 3. orf. The limits between which the preferred frame is open is
- returned for windows with scores over 0.5.
-
- The second part of the output is the system's interpretation of
- the raw data. This output gives the limits (in general a minimum)
- of the extent of the coding exon, the most likely strand for the
- exon with a probability for the correctness of the strand
- assignment, the preferred reading frame for the exon and a
- quality assessment. An interesting phenomenon we have noted
- is that some exons seem to have coding character on both strands
- or even more coding character on the wrong strand. be aware that strand
- assignments are not always correct, and it is sometimes useful to
- consider both strands as possible. Any exon with a quality score of
- "excellent" is worth further consideration. Please remember that
- the system is designed to find coding exon of 100 or more bases,
- so small coding exons may well be missed.
-
- This implementation of the CRM has been tested on a set of human
- genes containing 102kb of sequence. This set contained 70 coding
- exons and the system identified 62 (89%) and assigned them all to
- the correct strand. (Though in a larger test set strand assignment
- was 90-95% correct). The preferred reading frame assignment was
- correct for 60 (96%) of these exons while the frame assignment
- for the other two had some ambiguity. Of the eight missed 6 were
- less than 100 bases long. Of 43 predicted exons with a quality
- score of "excellent" all were actual coding exons. Of predicted
- exons scoring "good" 11 of 16 (69%) were expected and of 49
- predicted exons with a score of "marginal" only 8 (16%) were
- "real". Though this is a rather limited test set, the results
- of this analysis give some guidance for interpreting CRM output.
-
- N.B. This is an alpha+ version so we are open to feed-back.
- We have a new e-mail address called GRAILMAIL@ORNL.GOV
- for user feedback to the GRAIL staff. Or communication can be
- addressed specifically to us:
-
- Direct questions to: Richard J. Mural, e-mail:
- m9l@stc10.ctd.ornl.gov
- Phone: 615-576-2938
-
- or
-
- Edward C. Uberbacher, e-mail:
- uber@msr.epm.ornl.gov
- Phone: 615-574-6134
-
- or
-
- GRAIL staff, e-mail:
- grailmail@ornl.gov
-
- To receive a copy of this help file send the message "help" to
- grail@ornl.gov.
-
-
- -------------------------------------------------------------
- Appendix A: GRAIL updates
- -------------------------------------------------------------
- Modifications to the GRAIL rule base for constructing the exon
- table from the coding probability information have been made as
- of Feb. 19, 1992. These changes have been designed to recognize
- situations where a single real exon, usually with significant
- extent, is recognized by GRAIL as multiple peaks or multiple exons.
- These additional rules interconnect predicated peaks under
- circumstances where consecutive predicated regions have the same
- preferred reading frame, the frame is open between them, and they
- are relatively close together. The result is generally a beneficial
- simplification of the exon table and a more accurate representation
- of exon structure. This also better adapts GRAIL for use with cDNAs.
- Feedback or questions can be addressed to GRAILMAIL@ornl.gov.
-
- The GRAIL staff
-
- ===========================================================================
- /////////////////////////////////////////////////////////////////////////
- ===========================================================================
-
-
- >------------------------------ GENE-ID OUTPUT -------------------------------<
-
-
- GENEID UPDATES
-
- 1. The top ranking gene model is now automatically compared to protein
- databases using the BLAST Network Service provided by the National
- Center for Biotechnology Information. The results will be mailed to
- you separately and might give you some clues as to the function of
- your gene.
-
- 2. NETGENE is now available on this server. Just include the keyword line
- "NetGene" between the keyword line "Genomic Sequence" and your
- sequence. More information is available in the info file which can be
- obtained by including the keyword line "geneid info".
-
- 3. GENEID was originally developed to predict the exon structure of
- full-length pre-mRNA. If the sequence does not contain first or last
- exons, then GENEID will still try to predict first and last exons,
- although they will tend to be short (<15 bp) and have low scores
- (<0.5). The lack of first or last exons may also affect the prediction
- of internal exons (see item 5. - 7. of the output). A future version
- will allow scanning for internal exons in small gene fragments.
-
- 4. If you have success in confirming GENEID predictions, we would like to
- hear about it. Send an email to steen@darwin.bu.edu.
-
- -------------------------------------------------------------------------------
-
-
-
-
- GENEID AND NETGENE ONLINE SYSTEMS FOR PREDICTION OF GENE STRUCTURE
- version 1.0 2/1/1992
-
- GENEID
- _______________________________________________________________________________
- Geneid is an Artificial Intelligence system for analyzing vertebrate genomic
- DNA and prediction of exons and gene structure (1). A prototype is implemented
- as a fast, automatic email-response system. Users have the option of having
- their DNA sequence analyzed by NetGene (2) simultaneously.
-
- REGISTRATION:
- Before or simultaneously with submitting a sequence for analysis, you need to
- register your name by sending a line with the word "register", followed by
- your name and address. Example:
-
- register, Don Johnson, Miami Vice, Baywiev Marina Dock A12, Miami, FL 34566-
- 1234, U.S.A.
-
- NOTE>> The line can be longer than 80 characters as long as it contains NO
- linebreaks, (that is, do NOT press the <Return> key until the end of the
- address.)
-
- Send the line in a mail to: geneid@darwin.bu.edu. The registration
- information will only be used for maintaining a file of the number and
- geographic distribution of the users.
-
- SUBMITTING SEQUENCES:
- Your sequences must be submitted in the following format (approximately same
- format as used for fasta, BLAST and GRAIL):
- You can submit only one sequence per mail. Put the sequence after the keyword
- "Genomic Sequence" as shown below:
-
- Genomic Sequence
-
- >seqname
- TTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCC
- CGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTG
- GCCAACTCCATCACTA...................
-
- (Restrict the line length to 80 characters. The seqname is limited to 20
- characters).
-
- NOTE>> IF YOUR MAIL DOES NOT CONTAIN THE KEYWORD "GENOMIC SEQUENCE", OR
- ANY OTHER KEYWORDS LISTED IN THIS FILE, NO MAIL WILL BE RETURNED TO YOU.
-
- If the reply file with the results will exceed the Mail limit of 300
- kB, the reply will be split into several files. On a UNIX system you
- could send the File containing the sequence as follows: mail -v
- geneid@darwin.bu.edu <File
-
-
- LIMITS:
- GeneId currently will not accept sequences smaller than 100 bp or larger
- than 20 kb.
-
- CONFIDENTIALITY:
- Your submitted sequence will be deleted automatically immediately after
- reception by GeneID.
-
-
- ANALYSIS:
- GeneID will scan your sequence for potential splice sites, startcodons, and
- stopcodons. Then it will try to assemble these into potential first exons,
- internal exons, and last exons. Exons will be evaluated according to a number
- of characteristics related to coding and splicing, and only likely exons will
- be kept. Mutually exchangeable exons (normally overlapping and in the same
- frame) will be put together in classes. Only the top 15 ranking first and
- last exon classes, and the top 35 ranking internal exon classes
- from each sequence will be kept, and assembled into potential gene models with
- open reading frame, that will be ranked according to quality of the exons
- they contain. The top 20 models will be included in the return mail. Your
- return mail will also contain lists of the sites and exons created during the
- analysis. GeneID will not analyze the reverse complement of your sequence. If
- you suspect a gene on the other strand, submit the reverse complement sequence
- separately.
-
- TIPS FOR USE OF GENEID:
- GeneID will try to identify first, internal, and last exons in each of the
- sequences you submit, and try to assemble these into models of ONE likely
- gene in each sequence. To avoid missing any exons, the number of exons will
- be vastly overpredicted, and only a few of them are likely to be true (they
- tend to be the top ranking exons, but a few true exons rank very low). But
- these few true exons are likely to be found in the gene models because they
- fit together to form a continuous open reading frame. Thus you should look to
- the gene models to find a probable coding region.
- If you submit a sequence that turns out to contain two genes, the behavior of
- GeneID is unpredictable. It could either predict one large gene containing
- both, or it could predict only the gene with the most typical charateristics.
- If you submit a sequence that contains only part of a gene, GeneID will try to
- identify an entire gene in this sequence. Thus the predicted first exon may
- actually be part of a true internal exon, or the predicted last exon may be
- part of a true internal exon. If GeneID fails to predict any genes, you might
- look at the potential exon lists.
- Thus you can experiment with input and response, by starting out with sequences
- that are not too long (for example less than 10 kb), and see if GeneID is
- able to extend the gene if you extend the sequence. If you have very large
- sequences, it may be a good idea to request analysis by NetGene first (see
- below). NetGene will analyze sequences up to 100 kb, and may find regions
- containing exons of very high likelihood. These regions can then be resubmitted
- to GeneID for further analysis.
- GeneID will not construct models with more than 22 exons.
- If the sequence contains frameshift errors in exons, then that may affect the
- quality of the prediction in the current implementation.
-
- ACCURACY:
- In a test on 28 genes from GenBank, 91% of the nucleotides were correctly
- predicted as coding or non-coding. Since these two categories are unequally
- represented, a better measure of accuracy may be the correlation coefficient,
- which was found to be 0.68. See paper for details.
-
- ANALYSIS TIME:
- Will depend on the load on the system and grows approximately linearly with
- the length of the sequence input. Expect at least 1 minute per kb. Longer
- response times can occur if the system is temporarily down (check with the
- UNIX command: "finger geneid@darwin.bu.edu").
-
- FURTHER INFORMATION:
- A preprint of a paper describing the development and testing of GeneID is
- available as a Stuffit.hqx file for Macintosh. Simply include the line:
-
- Preprint Request
-
- in your mail to geneid@darwin.bu.edu, and the manuscript will be mailed to you.
-
-
- REFERENCING:
- Publication of output from GeneID must be referenced as follows:
- (1) Guigo, R., Knudsen, S., Drake, N., and Smith, T. (1992) Prediction of Gene
- Structure. Journal of Molecular Biology. 226:141-157.
-
-
- PROBLEMS, COMMENTS, AND SUGGESTIONS:
- Can be mailed to steen@darwin.bu.edu.
-
- Users of the MBCRR and BMERC national computer resources have direct
- online access to GeneID from their account. Contact Tom Graf at
- tom@mbcrr.harvard.edu for information on these accounts.
-
-
-
- NETGENE
- ________________________________________________________________________________
- Users now have the option of having their submitted sequence analyzed by NetGene
- also. NetGene predicts splice sites and gives information about the likelihood
- of the prediction. NetGene detects both coding regions and splice signals, and
- combines that information to predict both small and large exons (it predicts one
- end of the exon, the acceptor or donor site).
-
- Simply include the keyword "NetGene" between the keyword "Genomic Sequence"
- and your sequence. The results of the NetGene analysis will be mailed to you
- separately. The only difference in sequence format is that NetGene will accept
- sequences UP TO 100 kb. Thus, NetGene can be used in conjunction with GeneID
- by first submitting a large sequence to NetGene (specify the keyword "NetGene";
- GeneID will not respond if the sequence is larger than 20 kb). Regions that
- show exons with very high likelihood can then be resubmitted to GeneID (<20kb)
- for further analysis. The minimum sequence length that NetGene will faithfully
- analyze is 451 bp.
-
- REFERENCING AND FURTHER INFORMATION
- Publication of output from NetGene must be referenced as follows:
- (2) Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of Human mRNA
- Donor and Acceptor Sites from the DNA Sequence. Journal of Molecular Biology
- 220:49-65.
-
- PROBLEMS, COMMENTS AND SUGGESTIONS:
- Can be mailed to : steen@darwin.bu.edu
-
-
-
- ========================================================================
- ///////////////////////////////////////////////////////////////////////
- ========================================================================
-
-
-
-
-
- GENMARK : SYSTEM FOR PREDICTING PROTEIN CODING REGIONS
- Version 1.1 4/15/92
- (Internet Electronic Mail Server)
-
-
- GENERAL INFORMATION
-
- GenMark is a software package available from the Georgia Tech
- School of Applied Biology & Office of Information Technology
- for the quick analysis of newly sequenced DNA.
-
- GenMark 1.1 is based on a special type of Markov chain model of coding
- and noncoding nucleotide sequences. It proves to be a quite sensitive indicator
- of protein coding regions in E.Coli and closely related species. The yield of
- false positive predictions from the analysis of a 96bp segment is about 10%, for
- false negatives, about 22.5% . The process for training the program for other
- species is fairly straightforward, and new species will be added later, based on
- demand and available information.
-
- GenMark is robust to the presence of ambiguities in newly sequenced DNA -
- up to 10% of the sample DNA may be indicated by ambiguity symbols.
-
- GenMark receives its submissions from your local electronic mail service
- and will reply with a list of open reading frames that it recognizes as protein
- coding regions. There are also various other options, such as a PostScript(tm)
- graph of the results, which may optionally be requested. GenMark should reply
- within an hour of a sequence's submission by way of electronic mail.
-
-
- SUBMISSION OF SEQUENCES FOR ANALYSIS
-
- Nucleotide sequences destined for processing should be sent via E-mail to:
-
- genmark@ford.gatech.edu
-
- The subject line of this message must contain one of three keywords:
-
- instructions
- registration
- genmark
-
- If the subject of the message is "instructions", GenMark will reply with
- the most current submission instructions and news available on the system.
-
- If the subject of the message is "registration", your message will be
- logged in a registration roster. It is NOT necessary to register in order to
- use GenMark. If you decide to register, we ask that you include your name, your
- E-mail address, and a brief list of the organisms which you would like to see
- supported in future versions of GenMark (the family Enterobacteriaceae should
- be fairly well represented by the E. Coli information).
-
- We will keep those persons who register informed with further developments
- in the software and its options.
-
- If the subject of the message is "genmark", the program will try and
- analyze the contents of the message as sequence information. The message
- should minimally have the word "data" on a line by itself, followed by the
- sequence information (see below for a discussion on how to supply options and
- some example submissions).
-
-
- SUPPLYING OPTIONS TO GENMARK
-
- No options are required for GenMark to function. The options specified
- below just change the manner in which the program works. Only one option is
- permissible per line. All of the options must occur before the keyword "data"
- and the sequence information. ALL OF THE KEYWORDS MUST BE ENTERED IN LOWERCASE
- LETTERS, the sequence itself doesn't matter.
-
- The options:
-
- # A comment. The rest of the line, after this symbol, is
- utterly ignored.
-
- address Alternative E-mail address. After this option, include a
- valid E-mail address to which the program should send the
- output to (if it is different than the address from which
- it was sent).
-
- name The name of the person who submitted the sequence. This is
- particularly important for sites where several people will
- be submitting sequences from the exact same E-mail address.
- After this option, include the name.
-
- order The Markov chain order to use. If you don't know what this
- is don't mess with it. Higher is better, up to a point. The
- default is 4, though orders 1 through 5 are now available.
- After this option, include the new order.
-
- psgraph Give PostScript(tm) output. This instructs the program to
- include a PostScript graph of the results which can be
- printed on any PostScript compatible printer. The page is
- divided into six horizontal panels with the probability
- function on the y-axis, and the nucleotide position along
- the x-axis. The six panels represent the six different frames,
- panels 1-3 indicate frames 1-3 on the direct strand, and
- panels 4-6 indicate frames 1-3 on complementary strand. Open
- reading frame indicators appear along the middle of each
- graph. Since there's a limit to the size of E-Mail messages,
- expect the PostScript output to be sent as several messages.
-
- step Set the window step. This must be stated as a multiple of
- 3 nucleotides. The default is 12. The practical upshot of
- this setting, is that it allows you some freedom in adjust-
- ing the resolution of the PostScript(tm) graph. For instance,
- step setting of 3 gives 4 times the resolution of the default
- of 12.
-
- threshold Set the open reading frame threshold. This number is the
- number between 0 and 1 (or between 0 and 100) that is the
- minimum value of the probability function (a percentage)
- that an open reading frame must have to be accepted as a
- protein coding region. The default is 0.50.
-
- title The title you want to give to your PostScript(tm) graph.
-
- window The size of the analysis window (if you don't know what
- this is, don't play with it). The default is 96 nucleotides
- and generally 96 to 144 nucleotides works best.
-
-
- SAMPLE SUBMISSIONS TO GENMARK
-
- SAMPLE 1
-
- > mail genmark@ford.gatech.edu
- Subject: genmark
- # This example shows a minimal submission, just using the defaults set by
- # the program.
- #
- # NOTE: this will reply automatically to the exact address that it was sent
- # from with only a list of open reading frames.
- #
- # The actual DNA sequence may have any standard ambiguity DNA symbols in it
- # Anything that isn't a letter (like numbers, punctuation, spaces, carriage
- # returns) will just be ignored.
- data
- TCSSATGCATGHCATCGATWWCTCAGTCAGNA...
-
-
- SAMPLE 2
-
- > mail genmark@ford.gatech.edu
- Subject: genmark
- # This is an example of using all of the different options.
- address biologist@college.edu
- name John Doe
- order 5
- psgraph
- step 6
- threshold 0.50
- title John Doe's New Protein Coding Region
- window 144
- data
- TCAGTTCCAAGGTTTCCCAAAGGGTTTTCCCCAAAAGGGG...
-
-
- THINGS TO WATCH OUT FOR
-
- The sendmail program used for transferring messages across the network
- is limited to messages that are 64000 characters long. Therefore, it is good
- to remember to send any imformation you might have in chunks smaller than the
- 64000 character limit.
-
- The PostScript(tm) output might take up more space than is permissible
- in a mail message so, GenMark will send the graphic in parts that are smaller
- than 64K in length.
-
- If you shrink the step down to 3 and send a good sized sequence, the
- PostScript(tm) output will be huge, so don't be suprised. Try and reserve doing
- that for smaller sequences. For short sequences, you'll want to make the step
- smaller. We suggest a step of 6 for any sequence under about 1.5kb long, and
- a step of 3 for sequences less than about 800 bases long.
-
- Don't ask the program to make the step larger than the window. It won't
- crash the program, but then again you'll probably just get garbage back.
-
- The sequences you send are deleted as soon as they have been processed
- by the program. We cannot recover them for you. If you do not receive a
- response in a couple of hours, something's wrong. Verify the format of your
- submission and resend it.
-
- The graphic response may be effective for analyzing the intron/exon
- structure of eukaryotic sequences, but there are no guarrantees. In such a
- case, the list of open reading frames would almost certainly be useless, only
- the graphic would make any sense.
-
- In many cases, the graphic output can tell you much more information
- about the sequence in question than the open reading frame listing alone.
- Careful evaulation of the graphic could yield clues as to sequencing errors
- and frameshifts.
-
-
- REFERENCES
-
- Should you refer to the results of GENMARK analysis you should use
- the following reference:
-
- Borodovsky M. (1990) Recognition of coding regions in nucleotide sequences.
- In M.F.Frank-Kamenetskii ed. Computer analysis of Genetic Texts, Nauka,
- Moscow.
-
- Borodovsky M. McIninch J. Prediction of Gene Locations Using DNA Markov Chain
- Models (Submitted to CABIOS).
-
-
- QUESTIONS, PROBLEMS, SUGGESTIONS
-
- Please send any comments or questions that you might have about the software
- or the method of coding region recognition to:
-
- mb56@hydra.gatech.edu (Mark Borodovsky)
-
- or
-
- gt1619a@hydra.gatech.edu (James McIninch)
-
-
-