NetNews Usenet Archive 1992 #31

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #31 / NN_1992_31.iso / spool / bionet / software / 2343 < prev next >

Wrap

Text File | 1992-12-30 | 25.6 KB | 611 lines

Newsgroups: bionet.software Path: sparky!uunet!cis.ohio-state.edu!magnus.acs.ohio-state.edu!usenet.ins.cwru.edu!gatech!udel!darwin.sura.net!welchgate.welch.jhu.edu!danj From: danj@welchgate.welch.jhu.edu (Dan Jacobson) Subject: Re: intron/exon borders Message-ID: <1992Dec30.220952.25847@welchgate.welch.jhu.edu> Organization: Johns Hopkins Univ. Welch Medical Library References: <30DEC199214434286@aardvark.ucs.uoknor.edu> Date: Wed, 30 Dec 1992 22:09:52 GMT Lines: 600 In article <30DEC199214434286@aardvark.ucs.uoknor.edu> bfrank@aardvark.ucs.uoknor.edu (FRANK,BART) writes: >Can anyone recommend a good program to screen human genomic seqeunces >and predict positions of intros/exon borders? > >Thanks, >Bart Frank >Internet: BFRANK@AARDVARK.UCS.UOKNOR.EDU There are three mail servers which do this type of thing, namely GRAIL, GENEID, and GENMARK. I am including information about these servers below. Happy Holidays, Dan Jacobson danj@welchgate.welch.jhu.edu ========================================================================= //////////////////////////////////////////////////////////////////////// ========================================================================= Welcome to GRAIL (Gene Recognition and Analysis Internet Link) Grail is an interface to a system which will ultimately provide automated gene assembly from DNA sequence data. Currently the system provides analysis of protein coding potential of a DNA sequence. The coding recognition module (CRM) uses a multiple- sensor neural network approach to identify coding exons than are at least 100 bases long. In its current configuration the CRM identifies 90% of such regions with less than 1 false positive coding exon per 5 coding exons indicated. Your success rate will depend on a number of parameters including the G/C content of your sequence. In general, coding regions in sequences of low G/C content are not as well recognized as those in higher G/C. Investigation is underway to try and improve the performance for low G/C sequences. This part of the system is specifically designed to locate regions of DNA sequence with protein encoding potential. The system has been trained to recognize coding regions in Human DNA but seems to work well on DNA sequences from other mammals. Because the system has not been tested extensively on species other than human, no claims are made for the predictions of coding potential on DNA's from other species. To use GRAIL you must first register and get a user ID. To become a registered user please send the following e-mail message to: grail@ornl.gov Register Your Name Your address Your phone number your E-mail address To have sequences analyzed send e-mail to: grail@ornl.gov The message will start with the word "sequences" followed by the number of sequences you are sending followed by your user ID followed by the sequences you wish to have analyzed in the following format: Sequences number_of_sequences your_user_ID >seq1name AAAAAAAA........ >seq2name TTTTTTTT.......... etc. For the system to return any interpretation the sequence to be analyzed must be at least 100 bases long (and not more than 100kb). For each sequence the following information will be returned: 1. The score for the coding potential for each position analyzed on each strand (the f-(forward) strand represents the sequence as received, and the r-(reverse) strand is the reverse compliment). These scores range from 0.0 to 1.0 and a score greater than 0.5 identifies a region with protein encoding potential. Non-coding regions often have a score of 0.000. To reduce the output, only regions with scores of at least 0.01 are reported. 2. frame. In calculating the coding potential, the system calculates the reading-frame which is "preferred" in the window over which the calculation is done and this information is returned for regions with scores over 0.5. 3. orf. The limits between which the preferred frame is open is returned for windows with scores over 0.5. The second part of the output is the system's interpretation of the raw data. This output gives the limits (in general a minimum) of the extent of the coding exon, the most likely strand for the exon with a probability for the correctness of the strand assignment, the preferred reading frame for the exon and a quality assessment. An interesting phenomenon we have noted is that some exons seem to have coding character on both strands or even more coding character on the wrong strand. be aware that strand assignments are not always correct, and it is sometimes useful to consider both strands as possible. Any exon with a quality score of "excellent" is worth further consideration. Please remember that the system is designed to find coding exon of 100 or more bases, so small coding exons may well be missed. This implementation of the CRM has been tested on a set of human genes containing 102kb of sequence. This set contained 70 coding exons and the system identified 62 (89%) and assigned them all to the correct strand. (Though in a larger test set strand assignment was 90-95% correct). The preferred reading frame assignment was correct for 60 (96%) of these exons while the frame assignment for the other two had some ambiguity. Of the eight missed 6 were less than 100 bases long. Of 43 predicted exons with a quality score of "excellent" all were actual coding exons. Of predicted exons scoring "good" 11 of 16 (69%) were expected and of 49 predicted exons with a score of "marginal" only 8 (16%) were "real". Though this is a rather limited test set, the results of this analysis give some guidance for interpreting CRM output. N.B. This is an alpha+ version so we are open to feed-back. We have a new e-mail address called GRAILMAIL@ORNL.GOV for user feedback to the GRAIL staff. Or communication can be addressed specifically to us: Direct questions to: Richard J. Mural, e-mail: m9l@stc10.ctd.ornl.gov Phone: 615-576-2938 or Edward C. Uberbacher, e-mail: uber@msr.epm.ornl.gov Phone: 615-574-6134 or GRAIL staff, e-mail: grailmail@ornl.gov To receive a copy of this help file send the message "help" to grail@ornl.gov. ------------------------------------------------------------- Appendix A: GRAIL updates ------------------------------------------------------------- Modifications to the GRAIL rule base for constructing the exon table from the coding probability information have been made as of Feb. 19, 1992. These changes have been designed to recognize situations where a single real exon, usually with significant extent, is recognized by GRAIL as multiple peaks or multiple exons. These additional rules interconnect predicated peaks under circumstances where consecutive predicated regions have the same preferred reading frame, the frame is open between them, and they are relatively close together. The result is generally a beneficial simplification of the exon table and a more accurate representation of exon structure. This also better adapts GRAIL for use with cDNAs. Feedback or questions can be addressed to GRAILMAIL@ornl.gov. The GRAIL staff =========================================================================== ///////////////////////////////////////////////////////////////////////// =========================================================================== >------------------------------ GENE-ID OUTPUT -------------------------------< GENEID UPDATES 1. The top ranking gene model is now automatically compared to protein databases using the BLAST Network Service provided by the National Center for Biotechnology Information. The results will be mailed to you separately and might give you some clues as to the function of your gene. 2. NETGENE is now available on this server. Just include the keyword line "NetGene" between the keyword line "Genomic Sequence" and your sequence. More information is available in the info file which can be obtained by including the keyword line "geneid info". 3. GENEID was originally developed to predict the exon structure of full-length pre-mRNA. If the sequence does not contain first or last exons, then GENEID will still try to predict first and last exons, although they will tend to be short (<15 bp) and have low scores (<0.5). The lack of first or last exons may also affect the prediction of internal exons (see item 5. - 7. of the output). A future version will allow scanning for internal exons in small gene fragments. 4. If you have success in confirming GENEID predictions, we would like to hear about it. Send an email to steen@darwin.bu.edu. ------------------------------------------------------------------------------- GENEID AND NETGENE ONLINE SYSTEMS FOR PREDICTION OF GENE STRUCTURE version 1.0 2/1/1992 GENEID _______________________________________________________________________________ Geneid is an Artificial Intelligence system for analyzing vertebrate genomic DNA and prediction of exons and gene structure (1). A prototype is implemented as a fast, automatic email-response system. Users have the option of having their DNA sequence analyzed by NetGene (2) simultaneously. REGISTRATION: Before or simultaneously with submitting a sequence for analysis, you need to register your name by sending a line with the word "register", followed by your name and address. Example: register, Don Johnson, Miami Vice, Baywiev Marina Dock A12, Miami, FL 34566- 1234, U.S.A. NOTE>> The line can be longer than 80 characters as long as it contains NO linebreaks, (that is, do NOT press the <Return> key until the end of the address.) Send the line in a mail to: geneid@darwin.bu.edu. The registration information will only be used for maintaining a file of the number and geographic distribution of the users. SUBMITTING SEQUENCES: Your sequences must be submitted in the following format (approximately same format as used for fasta, BLAST and GRAIL): You can submit only one sequence per mail. Put the sequence after the keyword "Genomic Sequence" as shown below: Genomic Sequence >seqname TTGGCCACTCCCTCTCTGCGCGCTCGCTCGCTCACTGAGGCCGGGCGACCAAAGGTCGCC CGACGCCCGGGCTTTGCCCGGGCGGCCTCAGTGAGCGAGCGAGCGCGCAGAGAGGGAGTG GCCAACTCCATCACTA................... (Restrict the line length to 80 characters. The seqname is limited to 20 characters). NOTE>> IF YOUR MAIL DOES NOT CONTAIN THE KEYWORD "GENOMIC SEQUENCE", OR ANY OTHER KEYWORDS LISTED IN THIS FILE, NO MAIL WILL BE RETURNED TO YOU. If the reply file with the results will exceed the Mail limit of 300 kB, the reply will be split into several files. On a UNIX system you could send the File containing the sequence as follows: mail -v geneid@darwin.bu.edu <File LIMITS: GeneId currently will not accept sequences smaller than 100 bp or larger than 20 kb. CONFIDENTIALITY: Your submitted sequence will be deleted automatically immediately after reception by GeneID. ANALYSIS: GeneID will scan your sequence for potential splice sites, startcodons, and stopcodons. Then it will try to assemble these into potential first exons, internal exons, and last exons. Exons will be evaluated according to a number of characteristics related to coding and splicing, and only likely exons will be kept. Mutually exchangeable exons (normally overlapping and in the same frame) will be put together in classes. Only the top 15 ranking first and last exon classes, and the top 35 ranking internal exon classes from each sequence will be kept, and assembled into potential gene models with open reading frame, that will be ranked according to quality of the exons they contain. The top 20 models will be included in the return mail. Your return mail will also contain lists of the sites and exons created during the analysis. GeneID will not analyze the reverse complement of your sequence. If you suspect a gene on the other strand, submit the reverse complement sequence separately. TIPS FOR USE OF GENEID: GeneID will try to identify first, internal, and last exons in each of the sequences you submit, and try to assemble these into models of ONE likely gene in each sequence. To avoid missing any exons, the number of exons will be vastly overpredicted, and only a few of them are likely to be true (they tend to be the top ranking exons, but a few true exons rank very low). But these few true exons are likely to be found in the gene models because they fit together to form a continuous open reading frame. Thus you should look to the gene models to find a probable coding region. If you submit a sequence that turns out to contain two genes, the behavior of GeneID is unpredictable. It could either predict one large gene containing both, or it could predict only the gene with the most typical charateristics. If you submit a sequence that contains only part of a gene, GeneID will try to identify an entire gene in this sequence. Thus the predicted first exon may actually be part of a true internal exon, or the predicted last exon may be part of a true internal exon. If GeneID fails to predict any genes, you might look at the potential exon lists. Thus you can experiment with input and response, by starting out with sequences that are not too long (for example less than 10 kb), and see if GeneID is able to extend the gene if you extend the sequence. If you have very large sequences, it may be a good idea to request analysis by NetGene first (see below). NetGene will analyze sequences up to 100 kb, and may find regions containing exons of very high likelihood. These regions can then be resubmitted to GeneID for further analysis. GeneID will not construct models with more than 22 exons. If the sequence contains frameshift errors in exons, then that may affect the quality of the prediction in the current implementation. ACCURACY: In a test on 28 genes from GenBank, 91% of the nucleotides were correctly predicted as coding or non-coding. Since these two categories are unequally represented, a better measure of accuracy may be the correlation coefficient, which was found to be 0.68. See paper for details. ANALYSIS TIME: Will depend on the load on the system and grows approximately linearly with the length of the sequence input. Expect at least 1 minute per kb. Longer response times can occur if the system is temporarily down (check with the UNIX command: "finger geneid@darwin.bu.edu"). FURTHER INFORMATION: A preprint of a paper describing the development and testing of GeneID is available as a Stuffit.hqx file for Macintosh. Simply include the line: Preprint Request in your mail to geneid@darwin.bu.edu, and the manuscript will be mailed to you. REFERENCING: Publication of output from GeneID must be referenced as follows: (1) Guigo, R., Knudsen, S., Drake, N., and Smith, T. (1992) Prediction of Gene Structure. Journal of Molecular Biology. 226:141-157. PROBLEMS, COMMENTS, AND SUGGESTIONS: Can be mailed to steen@darwin.bu.edu. Users of the MBCRR and BMERC national computer resources have direct online access to GeneID from their account. Contact Tom Graf at tom@mbcrr.harvard.edu for information on these accounts. NETGENE ________________________________________________________________________________ Users now have the option of having their submitted sequence analyzed by NetGene also. NetGene predicts splice sites and gives information about the likelihood of the prediction. NetGene detects both coding regions and splice signals, and combines that information to predict both small and large exons (it predicts one end of the exon, the acceptor or donor site). Simply include the keyword "NetGene" between the keyword "Genomic Sequence" and your sequence. The results of the NetGene analysis will be mailed to you separately. The only difference in sequence format is that NetGene will accept sequences UP TO 100 kb. Thus, NetGene can be used in conjunction with GeneID by first submitting a large sequence to NetGene (specify the keyword "NetGene"; GeneID will not respond if the sequence is larger than 20 kb). Regions that show exons with very high likelihood can then be resubmitted to GeneID (<20kb) for further analysis. The minimum sequence length that NetGene will faithfully analyze is 451 bp. REFERENCING AND FURTHER INFORMATION Publication of output from NetGene must be referenced as follows: (2) Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal of Molecular Biology 220:49-65. PROBLEMS, COMMENTS AND SUGGESTIONS: Can be mailed to : steen@darwin.bu.edu ======================================================================== /////////////////////////////////////////////////////////////////////// ======================================================================== GENMARK : SYSTEM FOR PREDICTING PROTEIN CODING REGIONS Version 1.1 4/15/92 (Internet Electronic Mail Server) GENERAL INFORMATION GenMark is a software package available from the Georgia Tech School of Applied Biology & Office of Information Technology for the quick analysis of newly sequenced DNA. GenMark 1.1 is based on a special type of Markov chain model of coding and noncoding nucleotide sequences. It proves to be a quite sensitive indicator of protein coding regions in E.Coli and closely related species. The yield of false positive predictions from the analysis of a 96bp segment is about 10%, for false negatives, about 22.5% . The process for training the program for other species is fairly straightforward, and new species will be added later, based on demand and available information. GenMark is robust to the presence of ambiguities in newly sequenced DNA - up to 10% of the sample DNA may be indicated by ambiguity symbols. GenMark receives its submissions from your local electronic mail service and will reply with a list of open reading frames that it recognizes as protein coding regions. There are also various other options, such as a PostScript(tm) graph of the results, which may optionally be requested. GenMark should reply within an hour of a sequence's submission by way of electronic mail. SUBMISSION OF SEQUENCES FOR ANALYSIS Nucleotide sequences destined for processing should be sent via E-mail to: genmark@ford.gatech.edu The subject line of this message must contain one of three keywords: instructions registration genmark If the subject of the message is "instructions", GenMark will reply with the most current submission instructions and news available on the system. If the subject of the message is "registration", your message will be logged in a registration roster. It is NOT necessary to register in order to use GenMark. If you decide to register, we ask that you include your name, your E-mail address, and a brief list of the organisms which you would like to see supported in future versions of GenMark (the family Enterobacteriaceae should be fairly well represented by the E. Coli information). We will keep those persons who register informed with further developments in the software and its options. If the subject of the message is "genmark", the program will try and analyze the contents of the message as sequence information. The message should minimally have the word "data" on a line by itself, followed by the sequence information (see below for a discussion on how to supply options and some example submissions). SUPPLYING OPTIONS TO GENMARK No options are required for GenMark to function. The options specified below just change the manner in which the program works. Only one option is permissible per line. All of the options must occur before the keyword "data" and the sequence information. ALL OF THE KEYWORDS MUST BE ENTERED IN LOWERCASE LETTERS, the sequence itself doesn't matter. The options: # A comment. The rest of the line, after this symbol, is utterly ignored. address Alternative E-mail address. After this option, include a valid E-mail address to which the program should send the output to (if it is different than the address from which it was sent). name The name of the person who submitted the sequence. This is particularly important for sites where several people will be submitting sequences from the exact same E-mail address. After this option, include the name. order The Markov chain order to use. If you don't know what this is don't mess with it. Higher is better, up to a point. The default is 4, though orders 1 through 5 are now available. After this option, include the new order. psgraph Give PostScript(tm) output. This instructs the program to include a PostScript graph of the results which can be printed on any PostScript compatible printer. The page is divided into six horizontal panels with the probability function on the y-axis, and the nucleotide position along the x-axis. The six panels represent the six different frames, panels 1-3 indicate frames 1-3 on the direct strand, and panels 4-6 indicate frames 1-3 on complementary strand. Open reading frame indicators appear along the middle of each graph. Since there's a limit to the size of E-Mail messages, expect the PostScript output to be sent as several messages. step Set the window step. This must be stated as a multiple of 3 nucleotides. The default is 12. The practical upshot of this setting, is that it allows you some freedom in adjust- ing the resolution of the PostScript(tm) graph. For instance, step setting of 3 gives 4 times the resolution of the default of 12. threshold Set the open reading frame threshold. This number is the number between 0 and 1 (or between 0 and 100) that is the minimum value of the probability function (a percentage) that an open reading frame must have to be accepted as a protein coding region. The default is 0.50. title The title you want to give to your PostScript(tm) graph. window The size of the analysis window (if you don't know what this is, don't play with it). The default is 96 nucleotides and generally 96 to 144 nucleotides works best. SAMPLE SUBMISSIONS TO GENMARK SAMPLE 1 > mail genmark@ford.gatech.edu Subject: genmark # This example shows a minimal submission, just using the defaults set by # the program. # # NOTE: this will reply automatically to the exact address that it was sent # from with only a list of open reading frames. # # The actual DNA sequence may have any standard ambiguity DNA symbols in it # Anything that isn't a letter (like numbers, punctuation, spaces, carriage # returns) will just be ignored. data TCSSATGCATGHCATCGATWWCTCAGTCAGNA... SAMPLE 2 > mail genmark@ford.gatech.edu Subject: genmark # This is an example of using all of the different options. address biologist@college.edu name John Doe order 5 psgraph step 6 threshold 0.50 title John Doe's New Protein Coding Region window 144 data TCAGTTCCAAGGTTTCCCAAAGGGTTTTCCCCAAAAGGGG... THINGS TO WATCH OUT FOR The sendmail program used for transferring messages across the network is limited to messages that are 64000 characters long. Therefore, it is good to remember to send any imformation you might have in chunks smaller than the 64000 character limit. The PostScript(tm) output might take up more space than is permissible in a mail message so, GenMark will send the graphic in parts that are smaller than 64K in length. If you shrink the step down to 3 and send a good sized sequence, the PostScript(tm) output will be huge, so don't be suprised. Try and reserve doing that for smaller sequences. For short sequences, you'll want to make the step smaller. We suggest a step of 6 for any sequence under about 1.5kb long, and a step of 3 for sequences less than about 800 bases long. Don't ask the program to make the step larger than the window. It won't crash the program, but then again you'll probably just get garbage back. The sequences you send are deleted as soon as they have been processed by the program. We cannot recover them for you. If you do not receive a response in a couple of hours, something's wrong. Verify the format of your submission and resend it. The graphic response may be effective for analyzing the intron/exon structure of eukaryotic sequences, but there are no guarrantees. In such a case, the list of open reading frames would almost certainly be useless, only the graphic would make any sense. In many cases, the graphic output can tell you much more information about the sequence in question than the open reading frame listing alone. Careful evaulation of the graphic could yield clues as to sequencing errors and frameshifts. REFERENCES Should you refer to the results of GENMARK analysis you should use the following reference: Borodovsky M. (1990) Recognition of coding regions in nucleotide sequences. In M.F.Frank-Kamenetskii ed. Computer analysis of Genetic Texts, Nauka, Moscow. Borodovsky M. McIninch J. Prediction of Gene Locations Using DNA Markov Chain Models (Submitted to CABIOS). QUESTIONS, PROBLEMS, SUGGESTIONS Please send any comments or questions that you might have about the software or the method of coding region recognition to: mb56@hydra.gatech.edu (Mark Borodovsky) or gt1619a@hydra.gatech.edu (James McIninch)