home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-07-05 | 7.2 KB | 172 lines | [TEXT/R*ch] |
- /* CONTIG ASSEMBLY PROGRAM (CAP)
-
- copyright (c) 1991 Xiaoqiu Huang
- The distribution of the program is granted provided no charge
- is made and the copyright notice is included.
-
- Proper attribution of the author as the source of the software
- would be appreciated:
- "A Contig Assembly Program Based on Sensitive Detection of
- Fragment Overlaps" (submitted to Genomics, 1991)
- Xiaoqiu Huang
- Department of Computer Science
- Michigan Technological University
- Houghton, MI 49931
- E-mail: huang@cs.mtu.edu
-
- ------
- call format now is:
- CAP2 file_of_fragments output_file MIN_OVERLAP PERCENT_MATCH
-
- where:
- file_of_fragments -- input data, in pearson/fasta format
- output_file -- output data, in fasta format
- MIN_OVERLAP; -- Minimum length of any overlap
- PERCENT_MATCH; -- Minimum identity percentage of any overlap
- (give percent as whole number relative to 100)
- -------
-
- The CAP program uses a dynamic programming algorithm to compute
- the maximal-scoring overlapping alignment between two fragments.
- Fragments in random orientations are assembled into contigs by a
- greedy approach in order of the overlap scores. CAP is efficient
- in computer memory: a large number of arbitrarily long fragments
- can be assembled. The time requirement is acceptable; for example,
- CAP took 4 hours to assemble 1015 fragments of a total of 252 kb
- nucleotides on a Sun SPARCstation SLC. The program is written in C
- and runs on Sun workstations.
-
- Below is a description of the parameters in the #define section of CAP.
- Two specially chosen sets of substitution scores and indel penalties
- are used by the dynamic programming algorithm: heavy set for regions
- of low sequencing error rates and light set for fragment ends of high
- sequencing error rates. (Use integers only.)
- Heavy set: Light set:
-
- MATCH = 2 MATCH = 2
- MISMAT = -6 LTMISM = -3
- EXTEND = 4 LTEXTEN = 2
-
- In the initial assembly, any overlap must be of length at least OVERLEN,
- and any overlap/containment must be of identity percentage at least
- PERCENT. After the initial assembly, the program attempts to join
- contigs together using weak overlaps. Two contigs are merged if the
- score of the overlapping alignment is at least CUTOFF. The value for
- CUTOFF is chosen according to the value for MATCH.
-
- DELTA is a parameter in necessary conditions for overlap/containment.
- Those conditions are used to quickly reject pairs of fragments that
- could not possibly have an overlap/containment relationship.
- The dynamic programming algorithm is only applied to pairs of fragments
- that pass the screening. A large value for DELTA means stringent
- conditions, where the value for DELTA is a real number at least 8.0.
-
- POS5 and POS3 are fragment positions such that the 5' end between base 1
- and base POS5, and the 3' end after base POS3 are of high sequencing
- error rates, say more than 5%. For mismatches and indels occurring in
- the two ends, light penalties are used.
-
- A file of input fragments looks like:
- >G019uabh
- ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAA
- GTCTTGCTTGAATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTAC
- TCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTACAGTAG
- GACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTT
- AATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTT
- ATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
- AGTCTTGTTACGTTATGACTAATCTTTGGGGATTGTGCAGAATGTTATTT
- TAGATAAGCAAAACGAGCAAAATGGGGAGTTACTTATATTTCTTTAAAGC
- >G028uaah
- CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTT
- TAAACACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGAT
- TGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGC
- TGGCAGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGC
- ATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCT
- TCCCCATCCCATCAGTCT
- >G022uabh
- TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTG
- TAGGTGATTGGGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCC
- CATTAAAACCCTTTATGCCCATACATCATAACACTACTTCCTACCCATAA
- GCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAAC
- ACAAAATTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATT
- GATTGAT
- >G023uabh
- AATAAATACCAAAAAAATAGTATATCTACATAGAATTTCACATAAAATAA
- ACTGTTTTCTATGTGAAAATTAACCTAAAAATATGCTTTGCTTATGTTTA
- AGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAATAATCCTCTAC
- GATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATGGAAATAAAAT
- GTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACATGAAATGCTTT
- TTAAAAGAAAATATTAAAGTTAAACTCCCCTATTTTGCTCGTTTTTGCTT
- ATCTAAAATACATTCTGCACAATCCCCAAAGATTGATCATACGTTAC
- >G006uaah
- ACATAAAATAAACTGTTTTCTATGTGAAAATTAACCTANNATATGCTTTG
- CTTATGTTTAAGATGTCATGCTTTTTATCAGTTGAGGAGTTCAGCTTAAT
- AATCCTCTAAGATCTTAAACAAATAGGAAAAAAACTAAAAGTAGAAAATG
- GAAATAAAATGTCAAAGCATTTCTACCACTCAGAATTGATCTTATAACAT
- GAAATGCTTTTTAAAAGAAAATATTAAAGTTAAACTCCCC
- A string after ">" is the name of the following fragment.
- Only the five upper-case letters A, C, G, T and N are allowed
- to appear in fragment data. No other characters are allowed.
- A common mistake is the use of lower case letters in a fragment.
-
- To run the program, type a command of form
-
- cap file_of_fragments
-
- The output goes to the terminal screen. So redirection of the
- output into a file is necessary. The output consists of three parts:
- overview of contigs at fragment level, detailed display of contigs
- at nucleotide level, and consensus sequences.
- '+' = direct orientation; '-' = reverse complement
- The output of CAP on the sample input data looks like:
-
- #Contig 1
-
- #G022uabh+(0)
- TATTTTAGAGACCCAAGTTTTTGACCTTTTCCATGTTTACATCAATCCTGTAGGTGATTG
- GGCAGCCATTTAAGTATTATTATAGACATTTTCACTATCCCATTAAAACCCTTTATGCCC
- ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
- AATTAAAGACTTGTTTAAACACAAAA-TTTAGACTTTTACTCAACAAAAGTGATTGATTG
- ATTGATTGATTGATTGAT
- #G028uaah+(145)
- CATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTGAATTAAAGACTTGTTTAAACACAAA
- A-TTTAGACTTTTACTCAACAAAAGTGATTGATTGATTGATTGATTGATTGATGGTTTAC
- AGTAGGACTTCATTCTAGTCATTATAGCTGCTGGCAGTATAACTGGCCAGCCTTTAATAC
- ATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTTGGTATGATTTATCTTTTTGGTCTTC
- TATAGCCTCCTTCCCCATCCC-ATCAGTCT
- #G019uabh+(120)
- ATACATCATAACACTACTTCCTACCCATAAGCTCCTTTTAACTTGTTAAAGTCTTGCTTG
- AATTAAAGACTTGTTTAAACACAAAAATTTAGAGTTTTACTCAACAAAAGTGATTGATTG
- ATTGATTGATTGATTGATGGTTTACAGTAGGACTTCATTCTAGTCATTATAGCTGCTGGC
- AGTATAACTGGCCAGCCTTTAATACATTGCTGCTTAGAGTCAAAGCATGTACTTAGAGTT
- GGTATGATTTATCTTTTTGGTCTTCTATAGCCTCCTTCCCCATCCCCATCAGTCTTAATC
- AGTCTTGTTACGTTATGACT-AATCTTTGGGGATTGTGCAGAATGTTATTTTAGATAAGC
- AAAA-CGAGCAAAAT-GGGGAGTT-A-CTT-A-TATTT-CTTT-AAA--GC
- #G023uabh-(426)
- GTAACGT-ATGA-TCAATCTTTGGGGATTGTGCAGAATGT-ATTTTAGATAAGCAAAAAC
- GAGCAAAATAGGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAG
- ATCAATTCTGAGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTT
- TTTCCTATTTGTTTAAGATCGTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAG
- CATGACATCTTAAACATAAGCAAAGCATATTTTTAGGTTAATTTTCACATAGAAAACAGT
- TTATTTTATGTGAAATTCTATGTAGATATACTATTTTTTTGGTATTTATT
- #G006uaah-(496)
- GGGGAGTTTAACTTTAATATTTTCTTTTAAAAAGCATTTCATGTTATAAGATCAATTCTG
- AGTGGTAGAAATGCTTTGACATTTTATTTCCATTTTCTACTTTTAGTTTTTTTCCTATTT
- GTTTAAGATCTTAGAGGATTATTAAGCTGAACTCCTCAACTGATAAAAAGCATGACATCT
- TAAACATAAGCAAAGCATATNNT-AGGTTAATTTTCACATAGAAAACAGTTTATTTTATG
- T
-
-
-
- Slight modifications by S. Smith on Mon Feb 17 10:18:34 EST 1992.
- These changes allow for command line arguements for several
- of the hard coded parameters, as well as a slight modification to
- the output routine to support GDE format. Changes are commented
- as:
-
- Mod by S.S.
-
-
- more mods for use with Macintosh SeqApp program, d.g. gilbert, June 93
- */
-