home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!europa.asd.contel.com!howland.reston.ans.net!usc!wupost!spool.mu.edu!agate!stanford.edu!enterpoop.mit.edu!eru.mt.luth.se!lunic!sunic!dkuug!uts!engeje
- From: engeje@uts.uni-c.dk (Jacob Engelbrecht)
- Newsgroups: bionet.software
- Subject: Re: intron/exon borders
- Message-ID: <1993Jan2.212630.683@uts.uni-c.dk>
- Date: 2 Jan 93 21:26:30 GMT
- References: <30DEC199214434286@aardvark.ucs.uoknor.edu>
- Organization: UNI-C, Danish Computing Centre for Research and Education
- Lines: 257
-
- In <30DEC199214434286@aardvark.ucs.uoknor.edu> bfrank@aardvark.ucs.uoknor.edu (FRANK,BART) writes:
-
- >Can anyone recommend a good program to screen human genomic seqeunces
- >and predict positions of intros/exon borders?
-
- >Thanks,
- >Bart Frank
- >Internet: BFRANK@AARDVARK.UCS.UOKNOR.EDU
-
- I include information of a mail server service for prediction of
- splice sites as described in our Journal of Molecular Biology article.
-
-
- ******** Announcement of the NetGene Mail-server: *********
-
- DESCRIPTION:
-
- The NetGene mail server is a service producing neural network
- predictions of splice sites in vertebrate genes as described in:
- Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of
- Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal
- of Molecular Biology, 220, 49-65.
-
-
- ABSTRACT OF JMB ARTICLE:
-
- Artificial neural networks have been applied to the prediction of
- splice site location in human pre-mRNA. A joint prediction scheme
- where prediction of transition regions between introns and exons
- regulates a cutoff level for splice site assignment was able to
- predict splice site locations with confidence levels far better than
- previously reported in the literature. The problem of predicting
- donor and acceptor sites in human genes is hampered by the presence
- of numerous amounts of false positives - in the paper the
- distribution of these false splice sites is examined and linked to a
- possible scenario for the splicing mechanism in vivo. When the
- presented method detects 95% of the true donor and acceptor sites it
- makes less than 0.1% false donor site assignments and less than 0.4%
- false acceptor site assignments. For the large data set used in this
- study this means that on the average there are one and a half false
- donor sites per true donor site and six false acceptor sites per true
- acceptor site. With the joint assignment method more than a fifth of
- the true donor sites and around one fourth of the true acceptor sites
- could be detected without accompaniment of any false positive
- predictions. Highly confident splice sites could not be isolated
- with a widely used weight matrix method or by separate splice site
- networks. A complementary relation between the confidence levels of
- the coding/non-coding and the separate splice site networks was
- observed, with many weak splice sites having sharp transitions in the
- coding/non-coding signal and many stronger splice sites having more
- ill-defined transitions between coding and non-coding.
-
-
- INSTRUCTIONS:
-
- In order to use the NetGene mail-server:
-
- 1) Prepare a file with the sequence in a format similar to the fasta
- format: the first line must start with the symbol '>', the next
- word on that line is used as the sequence identifier. The
- following lines should contain the actual sequence, consisting of
- the symbols A, T, U, G, C and N. U is converted to T, letters not
- mentioned are converted to N. All letters are converted to upper
- case. Numbers, blanks and other nonletter symbols are skipped.
- The lines should not be longer than 80 characters. The minimum
- length analyzed is 451 nucleotides, and the maximum is 100000
- nucleotides (your mail system may have a lower limit for the
- maximum size of a message). Due to the non-local nature of the
- algorithm sites closer than 225 nucleotides to the ends of the
- sequence will not be assigned.
-
- 2) Mail the file to netgene@virus.fki.dth.dk. The response time will
- depend on system load. If nothing else is running on the machine
- the speed is about 1000 nucleotides/min. It may take several
- hours before you get the answer, so please do not resubmit a job
- if you get no answer within a short while.
-
-
- REFERENCING AND FURTHER INFORMATION
-
- Publication of output from NetGene must be referenced as follows:
- Brunak, S., Engelbrecht, J., and Knudsen, S. (1991) Prediction of
- Human mRNA Donor and Acceptor Sites from the DNA Sequence. Journal
- of Molecular Biology, 220, 49-65.
-
-
- CONFIDENTIALITY
- Your submitted sequence will be deleted automatically immediately
- after processing by NetGene.
-
-
- PROBLEMS AND SUGGESTIONS:
-
- Should be addressed to:
-
- Jacob Engelbrecht
-
- e-mail: engel@virus.fki.dth.dk
-
- Department of Physical Chemistry
- The Technical University of Denmark
- Building 206
- DK-2800 Lyngby
- Denmark
-
- phone: +45 4288 2222 ext. 2478 (operator)
- phone: +45 4593 1222 ext. 2478 (tone)
- fax: +45 4593 4808
-
-
- EXAMPLE:
-
- A file test.seq is prepared with an editor with the following contents:
-
- >HUMOPS
- GGATCCTGAGTACCTCTCCTCCCTGACCTCAGGCTTCCTCCTAGTGTCACCTTGGCCCCTCTTAGAAGC
- CAATTAGGCCCTCAGTTTCTGCAGCGGGGATTAATATGATTATGAACACCCCCAATCTCCCAGATGCTG
- . Here come more lines with sequence.
- .
- .
-
- This is sent to the NetGene mail-server, on a Unix system like this:
- mail netgene@virus.fki.dth.dk < test.seq
-
- In return an answer similar to this is produced:
-
- From netgene@virus.fki.dth.dk Fri Mar 20 13:30 MET 1992
- Received: by virus.fki.dth.dk
- (16.7/16.2) id AA05624; Fri, 20 Mar 92 13:30:41 +0100
- Date: Fri, 20 Mar 92 13:30:41 +0100
- From: virus mail server <netgene@virus.fki.dth.dk>
- Return-Path: <netgene@virus.fki.dth.dk>
- To: engel@virus.fki.dth.dk
- Subject: HUMOPS: NetGene splice site prediction
- Status: RO
-
-
- ------------------------------------------------------------------------
- NetGene
- Neural Network Prediction of Splice Sites
-
- Reference:
- Brunak, S., Engelbrecht, J., and Knudsen, S. (1991). Prediction of
- Human mRNA donor and acceptor sites from the DNA sequence. Journal of
- Molecular Biology 220:49-65.
- ------------------------------------------------------------------------
-
- Report ERRORS to Jacob Engelbrecht engel@virus.fki.dth.dk.
-
- Potential splice sites are assigned by combining output from a local and
- a global network. The prediction is made with two cutoffs: 1) Highly
- confident sites (no or few false positives, on average 50% of the true
- sites detected); 2) Nearly all true sites (more false positives - on
- average of all positions 0.1% false positive donor sites and 0.4% false
- positive acceptor sites, at 95% detection of true sites). The network
- performance on sequences from distantly related organisms has not been
- quantified. Due to the non-local nature of the algorithm sites closer
- than 225 nucleotides to the ends of the sequence cannot be assigned.
-
-
-
- Column explanations, field identifiers:
-
- POSITION in your sequence (either first or last base in intron).
- Joint CONFIDENCE level for the site (relative to the cutoff).
- EXON INTRON gives 20 bases of sequence around the predicted site.
- LOCAL is the site confidence from the local network.
- GLOBAL is the site confidence from the global network.
-
- ------------------------------------------------------------------------
- The sequence: HUMOPS contains 6953 bases, and has the following composition:
- A 1524 C 2022 G 1796 T 1611
-
-
- 1) HIGHLY CONFIDENT SITES:
- ==========================
-
- ACCEPTOR SITES:
- POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL
- 4094 0.27 TGTCCTGCAG^GCCGCTGCCC 0.63 0.66
- 5167 0.20 TGCCTTCCAG^TTCCGGAACT 0.59 0.64
- 3812 0.17 CTGTCCTCAG^GTACATCCCC 0.68 0.54
- 3164 0.02 TCCTCCTCAG^TCTTGCTAGG 0.79 0.32
- 2438 0.01 TGCCTTGCAG^GTGAAATTGC 0.78 0.33
-
- DONOR SITES:
- POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL
- 3979 0.38 CGTCAAGGAG^GTACGGGCCG 0.92 0.74
- 2608 0.17 GCTGGTCCAG^GTAATGGCAC 0.85 0.54
- 4335 0.06 GAACAAGCAG^GTGCCTACTG 0.83 0.41
-
-
- 2) NEARLY ALL TRUE SITES:
- =========================
-
- ACCEPTOR SITES:
- POSITION CONFIDENCE INTRON EXON LOCAL GLOBAL
- 4094 0.55 TGTCCTGCAG^GCCGCTGCCC 0.63 0.66
- 3812 0.52 CTGTCCTCAG^GTACATCCCC 0.68 0.54
- 3164 0.49 TCCTCCTCAG^TCTTGCTAGG 0.79 0.32
- 5167 0.49 TGCCTTCCAG^TTCCGGAACT 0.59 0.64
- 2438 0.48 TGCCTTGCAG^GTGAAATTGC 0.78 0.33
- 4858 0.39 TCATCCATAG^AAAGGTAGAA 0.77 0.20
- 3712 0.36 CCTTTTCCAG^GGAGGGAATG 0.88 -0.01
- 4563 0.33 CCCTCCACAG^GTGGCTCAGA 0.81 0.05
- 5421 0.33 TTTTTTTAAG^AAATAATTAA 0.75 0.13
- 3783 0.29 TCCCTCACAG^GCAGGGTCTC 0.64 0.26
- 3173 0.25 GTCTTGCTAG^GGTCCATTTC 0.52 0.36
- 4058 0.24 CTCCCTGGAG^GAGCCATGGT 0.43 0.51
- 1784 0.22 TCACTGTTAG^GAATGTCCCA 0.68 0.08
- 6512 0.21 CCCTTGCCAG^ACAAGCCCAT 0.67 0.08
- 2376 0.20 CCCTGTCTAG^GGGGGAGTGC 0.61 0.16
- 1225 0.18 CCCCTCTCAG^CCCCTGTCCT 0.65 0.07
- 1743 0.13 TTCTCTGCAG^GGTCAGTCCC 0.62 0.03
- 3834 0.13 GGGCCTGCAG^TGCTCGTGTG 0.26 0.58
- 4109 0.13 TGCCCAGCAG^CAGGAGTCAG 0.29 0.54
- 6557 0.13 CATTCTGGAG^AATCTGCTCC 0.56 0.12
- 1638 0.11 CCATTCTCAG^GGAATCTCTG 0.62 0.00
- 247 0.10 GCCTTCGCAG^CATTCTTGGG 0.55 0.11
- 6766 0.09 CTATCCACAG^GATAGATTGA 0.64 -0.06
- 906 0.08 AATTTCACAG^CAAGAAAACT 0.61 -0.02
- 6499 0.08 CAGTTTCCAG^TTTCCCTTGC 0.55 0.06
- 378 0.07 GTACCCACAG^TACTACCTGG 0.24 0.52
- 3130 0.07 CTGTCTCCAG^AAAATTCCCA 0.51 0.12
- 4272 0.07 ACCATCCCAG^CGTTCTTTGC 0.58 0.00
- 4522 0.07 TGAATCTCAG^GGTGGGCCCA 0.51 0.12
- 5722 0.07 ACCCTCGCAG^CAGCAGCAAC 0.55 0.05
- 2316 0.06 CTTCCCCAAG^GCCTCCTCAA 0.40 0.27
- 2357 0.06 GCCTTCCTAG^CTACCCTCTC 0.39 0.28
- 2908 0.06 TTTGGTCTAG^TACCCCGGGG 0.51 0.10
- 4112 0.06 CCAGCAGCAG^GAGTCAGCCA 0.25 0.50
- 1327 0.05 TTTGCTTTAG^AATAATGTCT 0.52 0.06
- 844 0.04 GTTTGTGCAG^GGCTGGCACT 0.62 -0.11
- 1045 0.04 TCCCTTGGAG^CAGCTGTGCT 0.54 0.01
- 1238 0.03 CTGTCCTCAG^GTGCCCCTCC 0.50 0.06
- 2976 0.03 CCTAGTGCAG^GTGGCCATAT 0.62 -0.12
- 3825 0.03 CATCCCCGAG^GGCCTGCAGT 0.16 0.60
- 1508 0.02 TGAGATGCAG^GAGGAGACGC 0.43 0.16
- 2257 0.02 CTCTCCTCAG^CGTGTGGTCC 0.53 0.00
- 5712 0.02 ATCCTCTCAG^ACCCTCGCAG 0.51 0.05
- 2397 0.00 CCCTCCTTAG^GCAGTGGGGT 0.41 0.16
- 4800 0.00 CATTTTCTAG^CTGTATGGCC 0.47 0.07
- 5016 0.00 TGCCTAGCAG^GTTCCCACCA 0.59 -0.11
-
- DONOR SITES:
- POSITION CONFIDENCE EXON INTRON LOCAL GLOBAL
- 3979 0.75 CGTCAAGGAG^GTACGGGCCG 0.92 0.74
- 2608 0.51 GCTGGTCCAG^GTAATGGCAC 0.85 0.54
- 4335 0.38 GAACAAGCAG^GTGCCTACTG 0.83 0.41
- 656 0.32 ACCCTGGGCG^GTATGAGCCG 0.56 0.66
- 5859 0.11 ACCAAAAGAG^GTGTGTGTGT 0.85 0.07
- 4585 0.09 GCTCACTCAG^GTGGGAGAAG 0.86 0.03
- 1708 0.06 TGGCCAGAAG^GTGGGTGTGC 0.85 0.01
- 6196 0.05 CCCAATGAGG^GTGAGATTGG 0.86 -0.01
- 667 0.03 TATGAGCCGG^GTGTGGGTGG 0.23 0.71
-
- ------------------------------------------------------------------------
-