home *** CD-ROM | disk | FTP | other *** search
Text File | 1995-03-06 | 73.8 KB | 1,483 lines |
- INSTRUCTION MANUAL for ABaCUS (Analysis of Blake's Conjecture Using
- Simulations) by Arlin Stoltzfus and David Spencer. Manual version 0.48, 5
- July 1994 (by A. Stoltzfus)
-
- ==========================================================================
- CONTENTS:
- ==========================================================================
-
- 0. HOW TO USE THIS INSTRUCTION MANUAL
-
- I. ABOUT ABACUS
-
- A. BASIC DESCRIPTION
- B. HARDWARE AND SOFTWARE REQUIREMENTS FOR THE PRE-COMPILED VERSION
- C. COMPILING THE ABACUS CODE FOR ANOTHER ENVIRONMENT
- D. CITING ABACUS IN PUBLISHED WORK; PROPRIETARY CLAIMS
-
- II. TUTORIAL: AN ANALYSIS OF CYTOCHROME C DATA
-
- A. STAGE 1: PREPARATION OF THE CYTOCHROME C DATA
- B. STAGE 2: CREATING THE NECESSARY DATA FILES
- C. STAGE 3: ANALYZING CORRESPONDENCES WITH THE CYTOCHROME C DATA SET
-
- III. GENERAL STEPWISE INSTRUCTIONS
-
- A. STAGE 1: COLLATE THE DATA PRIOR TO USING ABACUS
- B. STAGE 2: ENTER THE OBSERVED DATA AND SAVE THEM TO FILES
- C. STAGE 3: EVALUATE CORRESPONDENCES
-
- IV. DETAILED COMMENTS
-
- A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS
- B. ARRAYS: CREATING, CONVERTING, SAVING AND LOADING
- C. BE CAREFUL WHEN ENTERING DATA
- D. LOADING ATOMIC COORDINATES FROM A PDB FILE
- E. GENERATING REFERENCE GENE DATA
- F. SCORING CORRESPONDENCES
- G. EVALUATING THE SIGNIFICANCE OF A CORRESPONDENCE
- H. SAVING RESULTS; FURTHER ANALYSIS OF SCORES; etc
- I. PLOTTING DIAGONAL PLOTS AND EXON PLOTS
-
- V. ADDITIONAL DETAILS
-
- A. HARD LIMITS ON PARAMETERS
- B. THE RANDOM NUMBER GENERATOR
- C. EXPLANATION OF THE SETTINGS MENU
- D. HOW TO CONTACT THE PDB
-
- VI. REFERENCES
-
- ==========================================================================
- 0. HOW TO USE THIS INSTRUCTION MANUAL
- ==========================================================================
-
- 0.A. IF YOU DON'T HAVE AN EXECUTABLE PROGRAM. Try out the DOS or Sun
- executable or read section I below to be sure that ABaCUS can solve the
- type of problem that you are interested in. If so, read section I.C.
- below, then open the header file "abacus.h" with a text editor. Read the
- instructions therein, make the necessary changes, and proceed. You may
- also want to check out section V.A. to help in tailoring ABaCUS to your
- needs.
-
- 0.B. IF YOU ALREADY HAVE AN EXECUTABLE PROGRAM. First, read section I
- (short). Next, be sure that the the executable ("abacus.exe" in DOS and
- "abacus" in SunOS) and the data file "pdb1ccrs.txt" are on your hard drive,
- in the same directory. For DOS users who wish to use graphics, be sure to
- include the graphics interface file (with the ".bgi" extension) appropriate
- for your hardware. Then read and perform the tutorial excercises, using
- ABaCUS. Plan on spending 30-60 minutes on the tutorial. This should be
- enough to familiarize you with the steps involved in preparing and
- analyzing data.
-
- O.C. IF YOU'RE NOT SURE ABOUT SOMETHING. This document represents a large
- amount of work devoted to explaining how ABaCUS works and how to use it.
- Please consult this manual for explanations of how data are handled and how
- operations are carried out. For questions about the meaning of statistical
- results of simulations, ask your local statistical consultant. For
- bit-twiddly questions, consult the code, which is heavily commented. For
- questions about the interpretation of results in the context of the
- evolution of introns, a good place to start is the general review by
- Doolittle (1987). Also, see Gilbert and Glynias (1994) and Stoltzfus, et
- al. (1994). As a last recourse, ask the authors for help, preferably by
- E-mail, at one of the addresses listed below. If you are carrying out a
- research project involving correspondences between split gene structure and
- protein structure, we would be happy to hear about it, even if you don't
- have any questions, and even if you don't find any correspondences.
-
- Dr. Arlin Stoltzfus and Dr. David Spencer
- Canadian Institute for Advanced Research
- Program in Evolutionary Biology
- Department of Biochemistry
- Dalhousie University
- Halifax, Nova Scotia B3H 4H7 CANADA
-
- internet: arlin@ac.dal.ca
- phone: 902-494-3569
- facsimile: 902-494-1355
-
-
- ==========================================================================
- I. ABOUT ABACUS
- ==========================================================================
-
- I.A. BASIC DESCRIPTION
-
- ABaCUS is a no-frills program to investigate the significance of the
- putative correspondence between exons and units of protein structure. This
- type of analysis takes the form of an attempt to eliminate the reference
- hypothesis (sometimes called a "null" hypothesis) that no correspondence
- exists. A reference hypothesis in this case consists of a reference model
- for random gene structures, and a scoring rule for quantifying
- correspondences (in principle, a test could be done by generating random
- protein structures instead of random gene structures, but this is
- impracticable). ABaCUS creates and reads files containing observed data
- supplied by the user, then uses this information to generate reference
- genes according to one of several available models. The observed and
- reference genes are then scored according to a correspondence rule
- designated by the user, and the scores are compared in order to determine
- whether the reference hypothesis (i.e., no correspondence) can be rejected.
-
- I.B. HARDWARE AND SOFTWARE REQUIREMENTS FOR THE PRE-COMPILED DOS VERSION
-
- The compiled program "ABaCUS.exe" runs in DOS. The minimal DOS platform is
- a 286-based PC-compatible computer with a monochrome monitor. Monochrome
- or color graphics are possible (drivers are provided for EGA, VGA, CGA and
- Hercules). If you are not sure which driver to use, just include all of
- the drivers in the same directory (ABaCUS will automatically use the
- correct one). There is also a precompiled SunOS version, which does not
- have graphics and thus requires no additional files.
-
- I.C. COMPILING THE ABACUS CODE FOR ANOTHER ENVIRONMENT
-
- All of the important parts of ABaCUS are portable to non-DOS environments.
- The graphics portion-- which is available only in the DOS environment, and
- is dependent on the Borland graphics library-- is interesting but not
- central to the task of hypothesis-testing. An ANSI-C-compliant version of
- ABaCUS has been compiled and run in BSD UNIX (using the Gnu C compiler
- 2.4.0 on a Sun running SunOS 4.1.2; also on a NeXT). To compile ABaCUS,
- one needs the main code block "abacus.c" and the header file "abacus.h".
- All alterations are made within the header file, which contains
- instructions for conditional compilation.
-
- If you have gotten an ABaCUS package from an Internet server, the ".readme"
- file associated with the package will give further information on
- compilation for specific environments.
-
- I.D. CITING ABACUS IN PUBLISHED WORK; PROPRIETARY CLAIMS
-
- A manuscript describing ABaCUS is in preparation (Stoltzfus and Spencer,
- 1998). For now, please cite "A. Stoltzfus and David Spencer, personal
- communication" as the source of ABaCUS, and refer to Stoltzfus, et al.
- (1994) for its use in analyzing correspondences.
-
- Because ABaCUS is a scientific application designed to aid in resolving a
- biological question, it is available to the general public. The code has
- no copyright at present, and may be distributed freely. We encourage
- interested biologists to analyze their data using ABaCUS, and to report the
- results (whether positive or negative) in trade journals. We would be
- delighted to receive a preprint or reprint of any manuscript describing
- analyses performed using with ABaCUS.
-
-
- ==========================================================================
- II. TUTORIAL: AN ANALYSIS OF CYTOCHROME C DATA
- ==========================================================================
-
- An analysis falls into three stages:
-
- A. gathering and collating observed data;
-
- B. creating data files using ABaCUS;
-
- C. evaluating correspondences using ABaCUS.
-
- The user must supply the data (sequence information) and the tools (e.g.,
- an alignment program) to collate it. ABaCUS provides the remaining
- accounting and computational tools. Once the data are prepared, analyses
- can be carried out in a single session lasting from a few minutes to a few
- hours (depending on the complexity of the case and the computing power
- available). The operations involved in each stage of a typical analysis
- are described in the tutorial and in section III below.
-
- II.A. STAGE 1: PREPARATION OF THE CYTOCHROME C DATA
-
- The data have already been prepared, as follows.
-
- II.A.1. Protein structure. The structure of rice cytochromeC in the file
- named "pdb1ccr.ent" was chosen (arbitrarily) from among three cytochrome C
- structures at the Brookhaven PDB that have a very fine resolution, of 1.5
- Angstroms. In addition to atomic coordinates, the file pdb1ccr.ent
- contains a list of the boundaries of alpha-helices (there are no
- beta-strands in cytochrome C).
-
- II.A.2. Intron-containing sequences. Kemmerer, et al (1991a, 1991b) listed
- a total of 5 distinct intron positions found in cytochrome C genes of rice,
- drosophila, arabidopsis, human, chicken, and mouse. A search for
- additional distantly related intron-containing sequences in GenBank yielded
- one gene, from Aspergillus nidulans, containing two intron positions
- (Raitt, et al., 1994). Alignments of the inferred amino acid sequences of
- all of these intron-containing genes indicate that there are a total of 6
- distinct intron positions, which can be represented in a minimal set of
- four sequences, from Arabidopsis, rice, chicken, and Aspergillus. It is
- possible that this set does not represent all currently known distinct
- intron positions, since there are literally hundreds of cytochrome C
- sequences in GenBank, and my search procedure did not involve screening
- each entry for potentially novel intron positions.
-
- II.A.3. Alignment with reference protein. The complete rice sequence
- (corresponding to the crystal structure) contains 111 residues, but only
- the latter 103 residues align with other cytochrome C sequences. Therefore,
- a text editor was used to delete data for the first 8 residues: the
- resulting shortened file is called "pdb1ccrS.TXT". This file has been
- included with the ABaCUS package. The positions of the 6 introns relative
- to the canonical-length sequence of rice cytochrome C are:
-
-
- source intron
- taxon position
-
- Arabidopsis 12-0
- rice 29-1
- animals 56-1
- Arab., Asp. 65-0
- rice 74-0
- Aspergillus 96-2
-
- The positions of alpha-helices relative to the canonical-length sequence of
- rice cytochrome C are:
-
- left & right
- structure boundaries (inclusive)
-
- helix1 2 to 14
- helix2 49 to 55
- helix3 60 to 69
- helix4 70 to 75
- helix5 87 to 103
-
-
- II.B. STAGE 2: CREATING THE NECESSARY DATA FILES
-
- II.B.1. Enter the observed intron positions. Enter the size of the gene
- as 103 codons and the number of introns as 6. Then input the numbers in
- the table of intron positions above. When entering the intron positions,
- separate the codon and phase using one or a few spaces. Use the "v=VIEW"
- command to see the intron positions. The console should look like this:
-
- OBS: 33 85 166 192 219 287
- SCORE: 0.0 0.0 0.0 0.0 0.0 0.0
-
- This means that the first intron is after the 33rd coding nucleotide of the
- canonical-length mRNA, that is, the 33rd inter-nucleotide site (an mRNA of
- N nucleotides has N-1 possible intron positions, or inter-nucleotide
- sites). If the intron positions entered were correct, then save them to a
- file named "cytobs.int" (short for "cytochromeC observed introns").
-
- If the intron positions entered were correct, and the number of codons
- entered was correct, then ABaCUS has also created a correct set of exon
- sizes. The set should look like this:
-
- OBS: 11 18 27 8 9 23 7
- SCORE: 0.0 0.0 0.0 0.0 0.0 0.0 0.0
-
- This means that the first 11 residues of the protein are assigned to the
- first exon, the next 18 to the second exon, and so on. Notice that there
- are 7 exon sizes for 6 intron positions, and that exon sizes are in codons
- (or residues), while intron positions are on a nucleotide scale. If the
- exon sizes are correct, save the exon sizes to a file called "cythyp.exn"
- (short for "cytochrome hypothetical exons").
-
- II.B.2. Enter the boundaries of the 5 helices.
-
- Go to the "d=DISCRETE" elements submenu, and choose "e=ENTER". Enter the
- gene size as 103 codons, and enter the left and right boundaries of helix1,
- using the numbers in the table above. That is, enter 2 and 14 for the left
- and right boundaries of helix1. Continue ("c=continue") entering elements
- until all five have been entered. Then choose "d=done". Choose "v=view"
- to view the array, which will be a string of 1's and 0's. If the secondary
- structure elements were entered correctly, the bottom of the display should
- show the following message:
-
- The average score per position is 0.500000.
-
- This is the average score for positions in the array. In this case, it
- happens (quite by chance!) that exactly half of the 308 possible intron
- positions (103 codons --> 309 bp --> 308 inter-nucleotide sites) are
- internal to structural elements. If the elements were entered correctly,
- save this array to a file named "cytsec1.arr".
-
- II.B.3. Convert the maximum array score.
-
- Now choose "c=convert" to convert the array to a new maximum score. Enter
- 9999 for the maximum score, and save the converted array to a file called
- "cytsecm.arr".
-
- The array created in step II.B.2 had a maximum score of 1, and could be
- used to give binary scores to introns: that is, 0 is assigned to intron
- positions between structural elements, and 1 is assigned to intron
- positions within structural elements. Converting the array to a high
- maximum score creates a graduated array in which each number in the array
- is the distance in bp to the nearest element-free region. Recall that the
- first helix began at residue 2. Therefore, the first three intron
- positions, 1-1, 1-2, and 2-0, fall in an inter-helix region, whereas the
- next introns, 2-1, 2-2, 3-0, etc are successively more deeply embedded in
- helix1. The first 65 numbers (representing the inter-nucleotide sites in
- the first 22 codons) should look like this:
-
- 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 19 18 17 16 15 14 13
- 12 11 10 9 8 7 6 5 4 3 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-
- The distance scores continue to increase until site 8-2 (19 nucleotides
- from the carboxy end of helix1), then they decrease as the carboxy end of
- helix1 is approached. The last site that can be considered "inside" helix1
- is 14-2 (1 nt from the carboxy end of helix1); the next site, 15-0, is
- "outside" helix1, and has a score of 0.
-
- Although there are some circumstances in which one wishes to limit the
- maximum score (e.g., to 9 or 15), one usually wants a completely graded
- array, and 9999 is sufficiently high to ensure that the maximum achievable
- score will be reached in any gene (unless its > 19998 bp in length!).
-
-
- II.B.3. Load the crystal structure of rice cytochrome C.
-
- Go to the "a=ATOMIC COORDINATES" submenus and choose "l=LOAD". Enter the
- name of the file, which is "pdb1ccrs.txt". After the file has been read, a
- warning message will appear, indicating that the numbering in the file was
- non-consecutive. This does not necessarily mean that the file has been read
- incorrectly-- for instance, the chicken TPI crystal structure (PDB file
- 1tim) has no residue #3 (the numbering in the file is 1, 2, 4, 5, 6 . . .
- 246, 247, 248, but there are really only 247 residues). In the case of
- "pdb1ccrs.txt", the first 8 residues of pdb1ccr.ent were deleted, and the
- 103 residues in "pdb1ccrs.txt" are thus numbered 9-111 instead of 1-103.
- The atomic coordinates maintained in memory by ABaCUS have the correct
- numbers, 1-103, because ABaCUS assigns its own, consecutive, numbering
- system as it reads the file.
-
- Now quit ABaCUS, and find the file "calpha.xyz". This file, which was
- written automatically by ABaCUS when "pdb1ccrs.txt" was read, contains only
- the C-alpha coordinate lines from the original file, and thus the file is
- 10-50 times smaller than the original. Change the name of the file from
- "calpha.xyz" to "pdb1ccrs.xyz". Since we know that the original file has
- been read correctly, we can use "1ccrsca.xyz" in place of it, to save
- space. Every time ABaCUS loads a crystal structure, it creates a file
- called "calpha.xyz" with the C-alpha data. This file can be used to check
- whether the crystal structure has been read correctly and, if so, it can be
- used in place of the original PDB file.
-
- II.C. STAGE 3: ANALYZING CORRESPONDENCES WITH THE CYTOCHROME C DATA SET
-
- Below are instructions for testing 3 hypotheses about the cytochrome C
- data. Each hypothesis involves a choice of a scoring rule and a reference
- gene model. The general form of each hypothesis will be that the observed
- gene data do not correspond (as quantified using the chosen scoring rule)
- to protein structure better than random introns or exons (generated by the
- reference model).
-
- II.C.1. Load the observed data files. Restart ABaCUS, and load the exon
- data file, "cythyp.exn", using the "l=LOAD" command in the main menu; load
- the array "cytsecm.arr" using the equivalent command in the "d=DISCRETE"
- submenu; and load the atomic coordinates in "1ccrsca.xyz" (or in
- "pdb1ccrs.txt", if you prefer to use the original file) using "l=LOAD" in
- the "a=ATOMIC COORDINATES" submenu. We'll load the intron position data
- later.
-
- II.C.2. Generate reference genes.
-
- II.C.2.a. Generate reference intron positions.
-
- Go to the "r=REFERENCE" genes submenu and choose "u=UNIFORM" intron
- positions. Unless you loaded the intron position file in step 1 above,
- ABaCUS gives an error message to the effect that reference intron positions
- cannot be generated unless observed intron positions have been loaded. The
- random reference gene data must reflect the properties of the observed gene
- data-- same number of introns, same gene length-- and therefore ABaCUS
- requires observed intron positions before it will generate reference intron
- positions. Exons are treated separately, but they follow the same rules.
-
- Go back to the main menu and load the intron position data from
- "cytobs.int" then return to the "r=REFERENCE" genes submenu. Using
- "u=UNIFORM", generate 1000 sets of uniform random intron positions, with
- the minimum inter-intronic distance set to 1 bp. Specifically, this model
- of random intron positions creates 1000 sets, each with 6 non-identical
- positions randomly drawn with uniform probabilities per inter-nucleotide
- site. This is the reference model for randomly placed introns.
-
- II.C.2.b. Generate reference exon sizes.
-
- Go to the "r=REFERENCE" submenu and choose "p=PERMUTE" exon sizes. Ask for
- 2 sets of permuted exon sizes. Go back to the main menu and view the exon
- sizes on the console. It is easy to see that each set of exons contains
- exactly the same sizes as the other sets-- the only difference is in the
- order. Now generate another two sets and view them by returning to the
- main menu and choosing "v=VIEW". Notice that there are not 4 reference
- sets, but only 2. This is because ABaCUS erases the previous list of 2
- sets and replaces it with the new list of 2 sets. ABaCUS can only keep ONE
- list of random exons in memory, and the list is erased and rewritten every
- time reference genes are generated. Intron positions are stored
- separately, but they follow the same rule.
-
- Go back to the reference genes submenu and generate 1000 sets of randomly
- permuted reference exon sizes. This is the reference model for exon sizes.
-
- II.C.3. Assign scores and evaluate the reference hypothesis.
-
- II.C.3.a. Assign scores and evaluate centrality of intron positions.
-
- This hypothesis, which we could call HC, for hypothesis regarding
- centrality, is that the intron positions do not correspond to central
- locations in the three-dimensional crystal structure better than randomly
- placed introns. The alternative is that introns tend to correspond to
- positions at the center of the protein. To carry out this test, we need
- observed intron positions, randomly placed intron positions, a crystal
- structure, and a method of measuring centrality. The first three things
- are already taken care of. Now all we need to do is measure the centrality
- of the observed and random intron positions, compare them, and draw a
- conclusion.
-
- Go to the "a=ATOMIC COORDINATES" menu and choose "c=CENTRALITY". Indicate
- that cytochrome C has only a single globular domain, and choose rule #4
- (this is the most logical rule for centrality; the other rules are not
- generally useful). ABaCUS will assign centrality scores to all sets of
- observed and reference introns, using the crystal structure in memory. Now
- return to the main menu, choose "t=TEST" and examine the results (add a
- comment at the prompt, if desired). Can the reference hypothesis, HC, be
- excluded?
-
- II.C.3.b. Assign scores and evaluate avoidance of secondary structures.
-
- The second hypothesis, HAS, is that the intron positions do not tend to
- avoid secondary structural elements better than randomly placed intron
- positions. The alternative is that intron positions tend to fall between
- secondary structures, or at least very close to their ends. The observed
- and random intron positions have already been generated (they are still in
- memory from the previous test). The scoring rule to be used in this test
- consists of the scores in the array "cytsecm.arr".
-
- Go to the "d=DISCRETE ELEMENTS" submenu, and choose "a=ASSIGN" to assign
- scores to the intron positions using the scoring array in memory. Since
- the array created earlier holds the distance in bp from each potential
- intron position to the nearest inter-element boundary, this is the score
- that the introns will receive. Return to the main menu to finish the test
- by choosing "t=TEST".
-
- At this point, take a break to notice several things about scoring rules.
- First, notice that the score assigned to a gene is the average of the
- constituent exon (or intron) scores. This is true for all of the scoring
- rules used by ABaCUS. Second, in all of the scoring rules used by ABaCUS,
- a lower scores indicates a better correspondence. For centrality, a low
- score means greater proximity to the center (the center of mass, to be
- exact) of the protein; for avoidance of secondary structure, a low score
- means that the distance to the nearest interÐelement region is small-- the
- introns are within, or close to, inter-element regions.
-
- Also, notice some things about the ABaCUS environment. The same list of
- 1000 sets of reference intron positions was used in two different tests.
- This is perfectly valid, and is actually preferable to generating separate
- sets for each test. The sets of intron positions stayed in memory, but the
- scores changed when a new scoring rule was chosen.
-
- II.C.3.c. Assign scores and evaluate the extensity of exon-encoded
- peptides.
-
- The third hypothesis, HE, is that the peptides encoded by exons are no less
- extended than those encoded by random exons. The alternative is that
- exon-encoded peptides tend to be non-extended or compact. The observed
- data are already loaded, and the reference model (in this case, random
- permutations of the observed order of exon sizes) has already been chosen.
- It remains to choose a scoring rule, assign scores to the observed and
- reference exons, and evaluate the hypothesis.
-
- Go to the "a=ATOMIC COORDINATES" submenu and choose "e=EXTENSITY" scores.
- Assign scores to the exons using rule "r=radius of gyration" (this is, in
- our opinion, the most sensible rule for extensity: the other rules are
- explained in section III). Now return to the main menu and choose "t=TEST"
- to evaluate the hypothesis.
-
- Before quitting, take a moment to see how ABaCUS maintains records on past
- and current experiments. This information is accessed using the "i=INFO"
- command in the main menu. Choose this command, then choose "p=past" to see
- the results of the three experiments that have been performed. Now choose
- "i=INFO" again and choose "c=current" to see descriptions of the data that
- are now in memory. In general, the "i=INFO" functions are useful for
- keeping track of what has and has not been done during a session.
-
- When "q=QUIT" is chosen from the main menu, you will prompted for the name
- of a file in which to save the results of the experiments performed. Name
- the file "tutorial.sum". The file will contain the information on past
- experiments that we viewed above.
-
- ===> This is the end of the tutorial. Section III provides generalized
- instructions for each of the steps done in the tutorial, and Sections IV
- and V provide details.
-
- ==========================================================================
- III. GENERAL STEPWISE INSTRUCTIONS
- ==========================================================================
-
- III.A. STAGE 1: COLLATE THE DATA PRIOR TO USING ABACUS
-
- Most of the effort in analyzing gene-protein correspondences will be spent
- preparing an observed case for analysis. Plan to devote a large amount of
- time to carrying out the following tasks: searching sequence databases to
- find known intron-containing sequences, checking the primary research
- literature to be sure that intron positions are correctly assigned, and
- aligning sequences with each other, as well as with protein structural
- elements. The following sequence of steps is recommended:
-
- III.A.1. Choose a protein for which intron-containing genes have been
- sequenced, and for which a crystal structure is known.
-
- III.A.2. Obtain a file containing atomic coordinates of the protein from
- the PDB. If there are several homologous structures to choose from, pick
- the one that is the best characterized (best refinement, most additional
- information on structural features).
-
- III.A.3. Make a list of boundaries of secondary structures and other
- structural elements. For example, PDB files often include a list of the
- boundaries of secondary structural elements.
-
- III.A.4. Search sequence databases to find all the known intron-containing
- genes. Align the inferred amino acid sequences with each other and with
- the protein whose structure has been determined.
-
- III.A.5. Make a list of all known intron positions in codon-phase notation
- relative to the protein whose structure has been determined. That is, for
- each intron, write down the corresponding residue number in the protein
- (each codon corresponds to a residue in the reference protein) and its
- phase (0, 1 or 2). An intron between codons 59 and 60 is 60-0 (codon 60,
- phase 0) in the notation of Dibb & Newman (1989).
-
- III.A.6. If an analysis of extensity is to be done, make a list of
- inferred ancestral intron positions. This list will be the same as the
- list of observed intron positions unless there are intron positions that
- are not separated by the first nucleotide of any codon (e.g., 29-1 and
- 29-2, or 29-1 and 30-0), or unless an "intron sliding" assumption is made
- on the basis of some looser criterion (for example, see Gilbert & Glynias,
- 1994). For further explanation, read the entire section IV.A., entitled
- INTRONS, EXONS AND INFERRED ANCESTRAL EXONS.
-
- Before starting ABaCUS, double-check that all positional data are numbered
- according to the same codon/residue numbering scheme, based on a multiple
- sequence alignment. For example, suppose that I am using the atomic
- coordinates and secondary structure boundaries for bovine dibibliomuctase.
- If the 199th codon of the rat dibibliomuctase gene has an intron in phase
- 1, and if the multiple sequence alignment shows that the encoded residue is
- homologous to the 193th residue of the bovine sequence, then that intron
- should be designated as position 193-1, not 199-1. If the bovine protein
- has a beta-strand at 185-191 and an alpha-helix at 195-211, then the
- incorrect intron assignment would place the intron in the middle of the
- alpha-helix, instead of where it belongs, between the beta strand and
- alpha-helix. Check and double check the data (see section IV.C. BE CAREFUL
- WHEN ENTERING DATA). Obtaining a low-quality result by doing a
- sophisticated analysis on low-quality data is called "garbage in, garbage
- out."
-
-
- III.B. STAGE 2: ENTER THE OBSERVED DATA AND SAVE THEM TO FILES
-
- NOTE: Before you start ABaCUS, make sure that the relevant crystal
- structure file (if necessary) and the executable file or files (in DOS,
- look for "abacus.exe" and either "egavga.bgi" or another appropriate BGI
- graphics driver) are all in the same directory. Also, have ready the lists
- of intron positions and structural boundaries. To launch ABaCUS, type
- "abacus".
-
- NOTE: The files created in this step should be kept in the same directory
- as abacus.exe. They can then be read back at any time. Its a good idea to
- keep a list of the file names and a description of what each file contains,
- unless you are running ABaCUS within a console (e.g., DOS in Windows) and
- can examine files from the shell without quitting ABaCUS.
-
- III.B.1. Enter the observed intron positions, then save the intron
- positions to a file with the ".int" extension. Enter the inferred
- ancestral intron positions, then save the resulting inferred ancestral
- *exon sizes* to a file with the ".exn" extension. To find out more about
- intron positions and exon sizes, and why they are treated separately, see
- section IV.A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS.
-
- III.B.2. Enter the boundaries of structural elements, then save them to a
- file with the ".arr" extension. If desired, convert the maximum penalty in
- the scoring array to a different value, then save the converted array with
- a different name. Repeat this process for each different type of
- structural element that is being considered. For more information, see
- section IV.B. on arrays.
-
- III.B.3. Attempt to load the crystal structure file. If there is no
- apparent problem, check the crystal structure by viewing its diagonal plot
- (if you have the DOS graphics version), or by comparing the cryptic output
- file "calpha.xyz" (which contains only the CA lines) with the original
- file. If any discrepancy is noted, see section IV.D. on loading atomic
- coordinates from a pdb file, and correct any problems before continuing.
-
-
- III.C. STAGE 3: EVALUATE CORRESPONDENCES
-
- The intron-based analyses in III.C.1 and III.C.2 below should ideally be
- done together (in either order), since the same set of reference intron
- positions can then be used for both analyses (this is what was done in the
- tutorial excercise). The exon-based analysis, section III.C.3, can be done
- before or after the intron-based analyses.
-
- III.C.1. Evaluate intron positions with respect to structural elements:
-
- a. load the observed set of intron positions;
- b. generate reference sets of uniform or PIID introns;
- c. load the structural element scoring array;
- d. score the introns using the scoring array;
- e. evaluate the scores;
-
- Repeat steps steps c-e as required for other types of structural
- elements (there is no need to generate a new set of reference intron
- positions for each analysis).
-
- III.C.2. Evaluate intron positions with respect to centrality. You will
- be prompted in step (d) to answer whether the protein has multiple globular
- domains and, if the answer is "yes", you will be prompted to supply the
- number and boundaries of the globular domains. Steps (a) and (b) will not
- be necessary if they have already been performed:
-
- a. load the observed set of intron positions;
- b. generate reference sets of uniform or PIID introns;
- c. load the crystal structure;
- d. score the introns by centrality;
- e. evaluate the scores.
-
- III.C.3. Evaluate the extensity of exon-encoded peptides.
-
- a. load the inferred ancestral exon sizes;
- b. generate reference sets of lognormal or permuted exon sizes;
- c. load the crystal structure (if not already loaded);
- d. score the exons by extensity of exon-encoded peptides;
- e. evaluate the scores.
-
- III.C.4. Save the results. If any experiments have been performed,
- choosing "q = quit" will give you the option of saving a numbered list of
- experiment summaries to disk. Name the file using the ".sum" extension.
-
-
- ==========================================================================
- IV. DETAILED COMMENTS
- ==========================================================================
-
- IV.A. INTRONS, EXONS AND INFERRED ANCESTRAL EXONS
-
- IV.A.1. How intron positions are handled.
-
- Intron positions are entered by the user in codon-phase notation (Dibb and
- Newman, 1989) and are then transformed to a scale of nucleotides, such that
- the intron is given the number of the gene nucleotide that precedes it. The
- formula is thus:
-
- position = 3 * (codon - 1) + phase
-
- For example, if the gene is 146 codons long, then it has 438 nucleotides
- and 437 possible intron positions. Intron 68-0 (codon-phase) is at position
- 201 (bp scale). Thus, the intron positions used by ABaCUS exactly preserve
- the information entered by the user.
-
- IV.A.2. How exon sizes are handled.
-
- Exon sizes are only used in conjunction with a crystal structure for
- evaluating the extensity of exon-encoded peptides. By contrast to intron
- positions, exon sizes are always rounded to integral numbers of codons,
- such that a partial codon is assigned entirely to the 5' exon. Therefore,
- if the first intron in a gene is at position 38-0, the first exon will be
- 37 codons long, but if the first intron is at 38-1 (or 38-2 or 39-0), the
- first exon is considered to be 38 codons long.
-
- IV.A.3. Why exons and introns are handled differently.
-
- The reason that exon sizes are NOT in bp, but in codons, is that the
- exon-based scoring done by ABaCUS utilizes the atomic coordinates of
- alpha-carbons. Each exon must be found to correspond to a unique set of
- alpha-carbons, and thus no resolution is to be gained by expressing exon
- sizes in bp. Using integral numbers of codons also simplifies several
- procedures, especially the generation of lognormally distributed exon
- sizes.
-
- For the case of intron positions, using a nucleotide scale allows
- potentially useful resolution with regard to the boundaries of structural
- elements: for instance, if there is a helix encoded by codons 9 to 16, then
- there is a non-arbitrary (though possibly trivial) sense in which introns
- at 9-0 and 17-0 DO NOT interrupt the helix, whereas introns just at 9-1 and
- 16-2 DO interrupt the helix. By contrast, in deciding how exons correspond
- to sets of C-alpha carbons, we can only make an arbitrary choice about
- whether 9-0 and 9-1 both separate residue 8 from residue 9, or whether 9-1
- should be treated as though it separates residue 9 from residue 10.
-
- IV.A.4. Prohibited exon sizes.
-
- Any listing of intron positions is allowable, as long as the positions are
- entered consecutively and they do not fall outside the stated boundaries of
- the gene. However, some allowable configurations of intron positions
- cannot be converted by ABaCUS into exon sizes, since exon sizes must be
- whole numbers. For instance, if the user enters intron positions at 245-1
- and 246-0, the exon sizes will not be calculated correctly, since both of
- these introns would (by the rule described above in IV.A.2) separate
- residue 245 from 246. If the exon size cannot be resolved as a whole
- number, then the user must change the set of intron positions accordingly.
- In this case, the solution would be to combine the two intron positions,
- and enter the average of the two values. The evolutionary rationale for
- doing this is explained in the next two sections.
-
- IV.A.5. Inferred ancestral exon sizes.
-
- According to the exon theory of genes, introns are lost but not gained.
- Therefore, each intron position is thought to represent an intron that
- physically existed when the gene was first assembled billions of years ago.
- In addition, each intron position has a unique set of scores for any
- conceivable correspondence metric, and the scores are unaffected by other
- intron positions (i.e., the score for position X is 5, whether or not there
- is another intron at position Y). Consider some of the cytochrome C
- introns listed earlier:
-
- Arab., Aspergillus 65-0
- rice 74-0
- Aspergillus 96-2
-
- The same ultimate conclusion would result if we analyzed each intron
- separately, and then combined the data, or if we list all of the introns
- together, since the positions are still the same (65-0, 74-0 and 96-2).
-
- Exons sizes are not like this. For instance, the real cytochrome C gene of
- Aspergillus has an exon extending from the first nt of codon 65 to the
- second nucleotide of codon 96. According to the exon theory of genes, the
- introns flanking this exon must have existed in the ancestral gene, but the
- exon did not necessarily exist in the ancestral gene. Instead, because an
- intron is found in rice at position 74-0, the observed exon from 65 to 96
- in Aspergillus would NOT have been in the ancestral gene (according to the
- exon theory of genes), but would have been divided by an intron at position
- 74-0. By combining the intron positions from various genes, we infer a
- hypothetical set of ancestral exon sizes. In this case, there are no real
- exons to correspond to any of the inferred ancestral exons (for instance,
- the gene from rice has an exon extending from the first nucleotide of codon
- 74 to the end of the gene, but in the inferred ancestral gene this would be
- broken by the intron at position 96-2). This is why these exon sizes are
- referred to as *hypothetical* or *inferred* ancestral exon sizes.
-
- IV.A.6. Intron "sliding".
-
- Advocates of the exon theory of genes maintain that intron positions within
- a few codons of each other must represent the same ancestral position that
- has migrated, or "slid", to different positions in descendent genes.
- Suppose that we find an intron position in cytochrome C at position 75-2,
- just 5 nt away from the intron position at 74-0 in rice cytochrome C.
- According to the exon theory of genes, the ancestral gene did NOT contain
- an exon extending from 74-0 to 75-2, and including an exon of this size in
- an analysis would therefore not be consistent with the assumptions of the
- exon theory of genes. Instead, the ancestral gene is posited to have had a
- single intron position represented by both of the extant positions at 74-0
- and 75-2.
-
- Invoking "sliding" creates two problems. First, how does one decide when
- introns are too close to have co-existed in the ancestral gene? Second,
- given a criterion for the first problem, how does one decide on the
- position of an ancestral intron that may have left descendants at
- non-identical positions? In passing we note that (based on our own
- preliminary analyses) intron positions probably do not exhibit non-random
- clustering patterns (for an intuitive look at this problem, see section
- IV.E.3.a on the reference model of uniform intron positions), therefore no
- criterion of closeness can be justified. Because of this, the whole issue
- of "sliding" is probably a non-issue based on a non-phenomenon: either
- "sliding" is so rampant that all clusters are dispersed to non-significant
- levels, or it is so rare that significant clusters of intron positions do
- not arise.
-
- Nevertheless, in order to test the exon theory of genes, one must proceed
- in a manner that is consistent with its assumptions (even if they cannot be
- justified on prior grounds), and this means invoking "sliding" to explain
- away any excess intron positions. The rigorous way to do this is to pick a
- precise rule and stick with it. Our rule is to consider all cases of intron
- positions within 3 codons of each other as cases of "sliding", and to
- estimate the position of the ancestral intron by taking the average of the
- extant intron positions.
-
- Note well that the need to invoke sliding would only arise when performing
- tests directly on exon sizes, not on intron positions. Even if "sliding"
- occurs, it is a more conservative test of intron positions to include all
- of the observed data than to use an additional assumption to amalgamate
- some of the observed data into hypothetical ancestral data.
-
-
- IV.B. ARRAYS: CREATING, CONVERTING, SAVING AND LOADING
-
- IV.B.1. Creating an array.
-
- The scoring arrays used by ABaCUS are linear arrays of integer penalties
- associated with each possible intron position in a gene. The penalties are
- assigned based on protein structural elements defined by the user. For
- example, consider an imaginary protein of 20 amino acids. This protein
- would be encoded by a gene with 20 codons, or 60 nucleotides. Thus, there
- would be 59 inter-nucleotide positions at which an intron might be found.
- Suppose that the protein has two alpha helices, one encompassing residues
- 3-12 and the other residues 13-19. Entering these boundaries into ABaCUS
- will produce the following array of scores for each intron position:
-
- 00000011111111111111111111111111111011111111111111111111000
-
- The array can be used as it is to score correspondences. Imagine that
- there are introns at position 5-0, 13-2 and 16-0. When scored by the above
- matrix, each of these introns would be assigned a score of 1 (introns in
- codons 1, 2 and 20 would receive a score 0, as well as introns at positions
- 3-0 and 13-0)
-
- IV.B.2. Converting an array.
-
- For most purposes, the array will be converted using a different maximum
- penalty (i.e., greater than 1), which is done with the "c=convert" function
- in the array submenu. With a maximum score of 9, the array shown in IV.B.1
- would look like this:
-
- 00000012345678999999999999987654321012345678999987654321000
-
- Using this array to score introns would be equivalent to deciding that the
- score for an intron will be the distance to the nearest inter-element
- region, up to a maximum of 9 bp (3 codons).
-
- IV.B.3. Saving, viewing and loading arrays.
-
- Array files can be saved and loaded by ABaCUS. The view command displays
- the array currently in memory, and calculates the average score for the
- array. These procedures are simple and require no further explanation.
-
- IV.C. BE CAREFUL WHEN ENTERING DATA
-
- ABaCUS has some smart menu handling features, in that it usually does not
- carry out nonsense operations in response to menu choices by the user. For
- instance, ABaCUS will not allow an attempt to draw a diagonal plot unless a
- crystal structure resides in memory. Likewise, when responding to the
- "t=TEST" command, ABaCUS will give an error message if no set of gene data
- is ready to test; if one set of data is ready, ABaCUS will test that set;
- if both sets are ready, ABaCUS will prompt the user for a choice.
-
- However, ABaCUS does not trap nonsense when the user is entering data on
- intron positions and boundaries of structural elements, or when the user is
- supplying parameters. For instance, if the user enters intron positions in
- non-consecutive order, this will create nonsense in downstream events.
- Likewise, if the boundaries of structural elements entered by the user are
- inverted, this will create nonsense in downstream events.
-
- For these reasons, it is recommended that the user enter all data and save
- them to files well before attempting to perform an analysis. Immediately
- after entering data, view the data using the appropriate v=view function,
- check for obvious errors, then save the data to disk. Check the resulting
- file for errors before proceeding with an analysis. Carefully record the
- number of codons for a gene. Be sure that sets of intron positions, sets
- of exon sizes, arrays, and atomic coordinates all match exactly in length.
-
-
- IV.D. LOADING ATOMIC COORDINATES FROM A PDB FILE
-
- Some PDB files can be read directly by the program, but some of them have
- to be edited. Specifically, ABaCUS will choke in the following cases:
-
- a) for multi-subunit crystal structures, due to the optional "subunit"
- field, which contains a single letter ( "A", or "B", for instance). Delete
- the data for one of the subunits, then remove the subunit designator from
- the remaining lines (i.e., use a text editor to search for " A " and
- replace it with " ").
-
- b) when the third and fourth fields run together due to long descriptors
- for alternative side chain conformations. The solution to this uncommon
- problem is to separate the fields by inserting spaces.
-
- The file reader only extracts data from the "CA" lines, for C-alpha
- carbons. If the crystal structure has been read incorrectly, this should
- be obvious in the distance plot. If necessary, troubleshoot the editing
- process by looking at the cryptic output file "calpha.xyz": this file
- (rewritten each time a crystal structure is entered) echoes the information
- from the crystal structure file that ABaCUS has read and successfully
- stored in memory. Note that for its internal use, ABaCUS renumbers the
- residues in the order they are read. The output file will retain the
- numbering in the original, even if it is non-consecutive. For a 10- to
- 50-fold savings in disk space, throw out the successfully read PDB file and
- replace it with ABaCUS's version of the file (be sure to rename it, to
- anything other than "calpha.xyz", or it will be overwritten by ABaCUS).
-
-
- IV.E. GENERATING REFERENCE GENE DATA
-
- IV.E.1. Why not "null" gene data instead of "reference" gene data.
-
- Speaking of a "null" hypothesis tends to imply that there is a single
- standard of nothingness or randomness against which the world can be judged
- to determine its somethingness or non-randomness. The words "null" and
- "random" tend to obscure the fact that a "null" or "random" model often
- involves complex assumptions, such as the complex reference models used by
- ABaCUS. Speaking of a reference model (instead of a "null" model) implies
- that we must be acutely concerned as to whether the form of the model and
- the parameters chosen are appropriate to serve as a reference for testing
- the sort of thing that we are interested in testing.
-
- IV.E.2. Logic of reference models.
-
- Reference models are used to generate sets of reference genes that have
- some of the properties of the observed data (e.g., same distribution of
- exon sizes). For the case of ABaCUS, the most important aspect of the
- reference algorithms is that they do not employ information on the protein
- structure. That is, imagine that I launch ABaCUS, input intron positions
- for my favorite gene, and then generate reference intron positions by one
- of several models. Since I haven't entered any other data, ABaCUS knows
- nothing about the protein structure, and therefore I can rest assured that
- the introns will be placed randomly with regard to the structure of the
- protein.
-
- If the reference model accurately reflects the important properties of the
- observed intron data, then, the resulting reference hypothesis has the
- following form:
-
- THE OBSERVED SET OF INTRON POSITIONS (or exon sizes) DOES NOT CORRESPOND TO
- PROTEIN STRUCTURE BETTER THAN IS EXPECTED AT RANDOM, GIVEN THE PROPERTIES
- OF THE OBSERVED POSITIONAL DISTRIBUTION OF INTRONS (observed size
- distribution of exons)
-
- IV.E.3. Implementation of Reference models.
-
- Once an observed set of introns is in memory, reference sets of intron
- positions can be generated; once a set of observed (or inferred ancestral)
- exons are in memory, reference sets of exons can be generated. The
- "r=REFERENCE" genes submenu calls five generators. Each reference gene
- generator can create a user-specified number of reference genes, each of
- which is a set of either J intron positions (bp) or K exon sizes (in
- codons), where J and K are the numbers of observed introns and hypothetical
- exons currently in memory, respectively. The user may specify hundreds or
- thousands of sets of reference exons or introns at a time (see IV.E.4
- regarding the number of sets to choose). Output from the reference gene
- generators may be saved by the user, as described in section V.C.5.
-
- IV.E.4. Descriptions of Reference models.
-
- IV.E.4.a. Uniform random introns.
-
- This function creates sets of uniformly distributed introns. The minimum
- distance between introns in a set is 1 bp (i.e., no position is chosen
- twice in a single set), unless the user specifies a higher number. The
- option to change the minimum distance is useful for gaining an intuitive
- sense for the random likelihood of closely-spaced introns-- some authors
- have claimed that introns within a few bp of each must have arisen by some
- special process of intron "sliding", but this is not true. The screen
- display, which shows the number of attempts needed to complete each set,
- reveals how very often a randomly distributed intron falls 0, 1, 2, 3, etc.
- positions away from a previously existing intron.
-
- IV.E.3.b. Introns by permuted inter-intronic distances.
-
- This function temporarily converts the observed set of intron positions
- into a set of inter-intronic distances, permutes these numbers randomly to
- generate random sets, then converts them back into intron positions. As
- with the function for permuting exons, large numbers of simulations should
- not be done from small numbers of intron positions (e.g., fewer than 10).
-
- IV.E.3.c. Lognormal exon sizes.
-
- This function creates random exons with the same lognormal mean and
- standard deviation as the observed set of exons in memory. Since most such
- sets of exons will not add up to the length of the observed gene, and since
- this condition is necessary, most sets of lognormal exons are discarded (as
- will be apparent from the display shown by this function). Imposing this
- condition might (one would suspect) distort the resulting distribution from
- its intended form, but no significant deviations are detectable in
- statistical tests.
-
- IV.E.3.d. Permuted exon sizes.
-
- This function creates successive random permutations of the observed order
- of exon sizes ('successive' meaning that each permutation is generated from
- the previous one, rather than from a common parental order). This
- reference model is the one used by Gilbert and Glynias (1994). Large
- numbers of simulations (>>100) should not be done from small numbers of
- exon sizes (e.g., fewer than 10), or the generation of identical and
- nearly-identical orders of exon sizes in different replicate exon sets will
- reduce the expected statistical reliability of the final result. If the
- numbers of exon sizes is large, this is a good non-parametric reference
- model.
-
- IV.E.3.e. Exponential exon sizes.
-
- This function creates exponentially distributed exon sizes, with the option
- for low-end censoring. It is not recommended in most cases, since in most
- cases the observed distribution of inter-intronic distances will not be
- exponential. In particular, if "intron sliding" has been invoked (see
- section IV.A.5), an exponential distribution is invalid unless low-end
- censoring is applied to screen out any inter-intronic distances that would
- be prohibited in the observed set by the "sliding" rule (e.g., the use of
- an exponential distribution by Gilbert and Glynias, 1994, is invalid for
- this reason, among others). Even with censoring invoked, the distribution
- of inter-intronic distances is usually much more like a lognormal
- distribution than an exponential one, unless there are large numbers of
- intron positions known for the gene (e.g., as in the case of GAPDH).
-
- IV.E.4. Number of reference sets to generate.
-
- The number of reference sets to generate is based on the desired accuracy
- of the resulting P value, and is strictly limited by memory availability
- when running in the DOS environment.
-
- IV.E.4.a. Accuracy of the P value. The P value is expected to have
- binomial variance, i.e.,
-
- V = P * (1 - P) / (N - 1).
-
- Imagine that 100 simulations have been done and two correspondence rules
- have been tested. Since only two tests have been done, the 5% critical
- level is applicable. Suppose that one test gives a P value of P = 5/100 =
- 5%, the other of P = 20/100 = 20%. These P values carry uncertainty: their
- expected 95% confidence intervals are +/- 0.044 and +/- 0.080,
- respectively. One may be confident that the second result (P = 20%) is not
- significant (i.e., it is extremely unlikely that this P value is really <
- 0.05). However, how does one interpret the first P value? It could be
- less than 1% (very significant!) or more than 9% (not significant at all!).
- In such a case, one cannot make a reliable judgment about the status of the
- reference hypothesis, because the P value itself carries too much
- uncertainty. If 1000 simulations are performed instead, then the
- probability might be found to have a more exact value of 0.043 or 0.078 or
- 0.061 or 0.036-- in each of these cases the reliability of the P value
- would be sufficient that its relationship to the 5% critical level, either
- higher (0.062, 0.078) or lower (0.036, 0.042), is reliable.
-
- IV.E.4.b. Memory limitations. Practical memory limitations are not an
- issue except in the DOS environment (especially 286-based machines). The
- startup screen displays how much of the DOS standard 640 K block is
- available for simulations, and makes an approximate calculation of the
- total number of simulations that can exist in memory (exon sets and intron
- sets combined) at any time. Regular users who wish to generate more than
- one thousand sets of reference genes with more than ca. 40 introns or exons
- per set should move to a non-DOS environment. DOS weenies can re-compile
- ABaCUS without the graphics and with kMaxNumValues set to 1 + X (where X is
- the maximum number of exons or introns needed per set) to maximize the
- number of simulations possible.
-
- IV.F. SCORING CORRESPONDENCES
-
- IV.F.1. Types of rules. There are three general models for evaluating
- correspondences:
-
- 1. Centrality of intron-associated residues.
- 2. Distance of intron positions to inter-element regions.
- 3. Extensity of exon-encoded peptides.
-
- The centrality and distance scores are assigned directly to intron
- positions, while the third type of score (extensity) is assigned directly
- to exons. Centrality scores and extensity scores are based on measurements
- of atomic coordinates-- thus they require a crystal structure. The
- distance scores are based on structural elements defined by the
- user.
-
- IV.F.2. Common features of correspondence rules.
-
- For ABaCUS, a "gene" is a set of exon sizes or intron positions. For all
- types of scoring rules, the score assigned to a gene is the average score
- for the intron positions or exon sizes in the gene. For all types of
- rules, a lower score indicates greater conformity to the expectations of
- Blake's conjecture (Blake, 1978) or the exon theory of genes as developed
- by Go, Gilbert, and others (see references in Stoltzfus, et al., 1994).
-
- IV.F.3. Centrality scores.
-
- Centrality scoring is done by choosing "c=centrality" from the "a=ATOMIC
- COORDINATES" submenu. Any observed or reference introns are scored using
- the crystal structure in memory and a user-designated choice of scoring
- rule. The lowest scores are achieved by centrally located
- introns/residues. The scoring schemes implemented for centrality scoring
- are:
-
- 1. intron score = percentage of pairwise distances > cutoff;
- 2. intron score = average of all pairwise distances;
- 3. intron score = maximum of all pairwise distances;
- 4. intron score = distance from center of mass of domain.
-
- The first rule is somewhat similar to the intuitive rule used by Go (1981)
- in proposing the boundaries of "modules" of hemoglobin. The second rule is
- similar to the rule implied by Figure 1 of Blake (1981). Stoltzfus, et al.
- 1994 use only rule #4, which we feel is the definitive rule for centrality.
- For multidomain proteins, you will be prompted to enter the domain
- boundaries when using this rule. Specifically, the center of mass of each
- domain is calculated, then introns are assigned a score equal to the
- distance in Angstroms from the residue associated with the intron to the
- center of mass of the domain in which it resides.
-
- To implement centrality scores, an arbitrary choice must be made about how
- to associate intron positions with residues in a crystal structure. For
- ABaCUS, the residue associated with an intron is defined as the residue
- encoded by a codon that is split by the intron, or that is bounded on its
- 5' end by the intron.
-
- For information on centrality plots, see section V.C.4.
-
-
- IV.F.4. Distance scores.
-
- Correspondences with regard to defined structural elements are analyzed by
- using distance scores. The complete set of all possible distance scores is
- stored in any array. Any number of arrays may be created by the user, to
- represent secondary structures, domains, motifs, modules, etc. Introns
- from the observed set and any reference sets are scored by the distance
- scoring array currently in memory when this scoring option is chosen.
-
- There is a single option in the settings menu that affects the manner in
- which distance scores are calculated (see V.C.9).
-
- In essence, one uses distance scores to detect correspondences between
- points on a line and segments of the line. For instance, one may ask
- whether introns in protein-coding gene fall between or within structural
- elements, or whether introns in structural RNAs fall between or within
- defined regions, such as base-paired regions or exposed regions. This type
- of scoring is readily adaptible to calculating the closeness or identity of
- one set of points on a line with another set of points (e.g., how closely
- does one set of introns match another set?).
-
- IV.F.5. Extensity scores.
-
- Scoring by the extensity of exon-encoded peptides is done using the
- "e=EXTENSITY" scores option of the atomic coordinates submenu. Five
- different scoring rules are implemented, some of which depend on a
- user-supplied arbitrary cutoff value in Angstroms:
-
- b (binary) score = 1 if any distance > cutoff; else score = 0;
- n (number) score = number of inter-C-alpha distances > cutoff;
- a (average) score = average inter-C-alpha distance;
- m (maximum) score = maximum inter-C-alpha distance;
- r (radius) score = radius of gyration.
-
- Each rule assigns scores to exons based on measurements on the atomic
- coordinates of the residues encoded by each exon, using the crystal
- structure in memory.
-
- The first three rules, based on distance cutoffs, are intended as precise
- versions of the inexact methods of Go (1981), Gilbert (1986, 1985) and
- others, in which arguments are made based on the appearance of a diagonal
- plots with distance cutoffs in the range of 23-28 Angstroms. The first two
- rules give somewhat erratic results. The second rule is equivalent in
- effect to the rule used by Gilbert and Glynias (1994; they assign to genes
- the sum, rather than the average, of exon scores, but this difference would
- not affect the final ranking of observed and reference scores).
-
- Stoltzfus, et al. (1994) concentrate on the "maximum" (a.k.a. "diameter")
- rule and the radius of gyration. The radius of gyration is a measure of
- 3-dimensional dispersion, defined simply as the root mean square distance
- of alpha carbons from the center of mass of the exon-encoded peptide.
-
-
- IV.G. EVALUATING THE SIGNIFICANCE OF A CORRESPONDENCE
-
- After each scoring of introns or exons, the results may be evaluated. A
- set of introns (or a set of exons) in memory carries only a single set of
- scores at a time, from the most recent scoring. The command "t=TEST" will
- take the observed and reference scores in memory, calculate means and
- standard deviations, and rank the observed score within the reference
- scores. The mean of the standard deviation of exon scores within a
- reference set is calculated, as well as the standard deviation of the mean
- gene score.
-
- A P value is calculated as the proportion of reference sets that score AS
- LOW OR LOWER than the observed set. This P value represents the chance of
- obtaining a correspondence as good or better than the one observed, if the
- reference hypothesis is true. If the P value is less than 5% or 1%
- (depending on the number of tests performed), then the reference hypothesis
- may be false.
-
- If the scores of the reference sets are normally distributed, then the
- difference between the observed and reference means (expressed in standard
- deviations of the reference mean) should be related to the P value by the
- normal probability function (e.g., if P = 0.05, then the observed mean
- should be lower than the reference mean by about 1.64 standard deviations
- of the reference mean). Scores derived by the centrality and extensity
- rules are usually distributed roughly normally. However, distance scores
- assigned by arrays often have a skewed distribution, especially if a low
- maximum score has been used to convert the array.
-
- Note that every time the "t=TEST" command is successfully executed, a
- description of a numbered experiment is stored in memory. The experiment
- list in memory continues to grow with each new experiment, and it can be
- saved as explained below.
-
-
- IV.H. SAVING RESULTS; FURTHER ANALYSIS OF SCORES; etc
-
- Each time the "t=TEST" command is executed, an experimental test of a
- hypothesis has been performed. As a first approximation, each such test is
- equally valid and therefore, in order to be rigorous, the conclusions drawn
- from a set of tests should represent the results of all experiments, rather
- than just "the ones that turned out right." Failure to follow this
- methodological imperative tends to lead to errors in which one or a few
- "significant" results from a large set of equally valid tests are singled
- out for special attention. An example of this type of error can be found
- in Go and Nosaka (1987) in which a subset of all available intron positions
- is singled out for special comment because it shows a "significant"
- correspondence.
-
- In order to save the results of hypothesis-testing to disk, you must choose
- "quit" from the main menu, and supply a name for the file to contain all
- experiment summaries. The summary writer was designed to save most of the
- parameters necessary to replicate each experiment (its good to take notes,
- though). Short user-supplied comments can be added to the experiment
- description in memory at the time the hypothesis is evaluated, and these
- comments will be written to disk when the experiment summary is saved.
-
- Under normal conditions, ABaCUS does not save detailed reference gene
- data-- it saves the mean, standard deviation and ranking of the observed
- sets relative to the reference score, and the rest is thrown away. This
- makes it impossible to analyze (for instance) the statistical distribution
- of reference gene scores, or to ask other interesting questions, such as
- "How low would an observed score have to be to rank in the lowest 5% or the
- lowest 1%?". However, questions such as these CAN be addressed if the user
- takes special steps to save the relevant data. There are three ways of
- doing this, each of which may be desirable under different circumstances,
- depending on the reason for saving the results:
-
- 1) If the reference introns or exons have been scored, the "save" function
- will include the scores when it writes the intron positions or exon sizes
- to disk. If the reference introns or exons have been scored and evaluated,
- the means and standard deviations will also be recorded. The resulting
- file can be large: a file with 1000 scored sets of reference genes, with 15
- introns in each set, takes up 200 K.
-
- 2) Settings can be changed to turn on a file writer that records the mean
- score for each reference set (only the mean for each set-- not the
- individual exon or intron scores). See section V.C.6.
-
- 3) The user may effectively "save" reference gene data by saving its
- initial conditions. See section V.C.10 for instructions on how to manually
- enter a random number seed that can be used at a later date to regenerate
- the same data.
-
-
- IV.I. PLOTTING DIAGONAL PLOTS AND EXON PLOTS
-
- IV.I.1. Diagonal plots.
-
- A diagonal plot, or C-alpha-C-alpha distance map, is a 2-dimensional
- contour map of a 3-dimensional protein structure, based on the pairwise
- distances between alpha-carbons, plotted on cartesian coordinates. Many
- diagonal plots that appear in the literature show three contours: very
- short pairwise distances (e.g., < 12 Angstroms) in gray, very long pairwise
- distances (e.g., > 28 Angstroms) in black, and intermediate distances in
- white (e.g., Go, 1981).
-
- IV.I.2. Exon plots.
-
- Exon plots are like diagonal plots, but they only show the distances between
- residues encoded by the same exons. The plot thus appears as a series of N
- right triangles with their hypotenuses along the diagonal, where N is the
- number of exons. It is possible to make exon plots of both the inferred
- ancestral set of exons, and reference sets of exons. Exon plots are
- sometimes useful for developing a nuts-and-bolts understanding of why
- different gene structures achieve different extensity scores.
-
- IV.I.3. Plotting options.
-
- ABaCUS is capable of making black & white distance plots (i.e., two
- contours), or color distance plots with 16 contours. For black & white
- plots, a single cutoff value distinguishes close and distant inter-residue
- distances. For color plots, there is a scaleable relationship between the
- 16-color palette and the distance between residues. Also, color plots can
- depict all distances (choose cutoff = 0.0 to do this), or only those
- distances greater than an arbitrary cutoff value (e.g., 25 Angstroms). The
- settings menu explains how to alter settings to suit your interests.
-
-
- ==========================================================================
- V. ADDITIONAL DETAILS
- ==========================================================================
-
- V.A. HARD LIMITS ON PARAMETERS
-
- Limits are set differently depending on whether or not the program is
- compiled in DOS:
-
- limit DOS non-DOS
- __________ _____ ________
-
- kMaxNameLength 14 30
-
- kMaxArraySize 2400 4000
-
- kMaxNumValues 26 101
-
- The first column of values is used if Compiled_in_DOS is #defined as 1 in
- the header file "abacus.h"; the second column is used when Compiled_in_DOS
- is set to 0.
-
- The experienced user may wish to alter these limits. kMaxNameLength
- refers to the length of file names. kMaxArraySize refers to the scoring
- arrays used in distance scoring (the DOS limit of 2400 sites, or 800
- codons, should be sufficient for most purposes). kMaxNumValues is 1 + the
- maximum number of intron positions or exon sizes per gene that you wish to
- use.
-
- There is no hard limit on the number of residues in a crystal structure or
- on the length of the gene represented by a set of intron positions or exon
- sizes.
-
-
- V.B. THE RANDOM NUMBER GENERATOR
-
- The code for the uniform random number at the heart of ABaCUS's simulations
- is taken from p. 282 of _Numerical Recipes in C_ (Press, et al., 1992; and
- references therein). This is the "ran2" long-period (about 10^18)
- pseudo-random number generator, described by the authors as "the generator
- of L'Ecuyer with Bays-Durham shffle and added safeguards". It returns a
- uniform random deviate between 0.0 and 1.0 (exclusive of the endpoint
- values).
-
- The routines for generating uniform intron positions and exponential exon
- sizes, and the routines for permuting exon sizes and inter-intronic
- distances rely directly on the uniform random number generator. The
- routine for generating lognormal exon sizes makes use of Box and Muller's
- general method of converting uniform random deviates into normal deviates.
-
- V.C. EXPLANATION OF THE SETTINGS MENU
-
- NOTE: The defaults for these settings are hard-coded. Any changes made to
- the settings are completely forgotten as soon as you quit the program. I
- probably should change the name to the "options" menu instead of the
- "settings" menu.
-
- V.C.1 Toggle between color and monochrome distance plots. This is
- self-explanatory.
-
- V.C.2. Toggle between single- and double-size distance plots. Normally,
- the distance plot of a protein R residues long is plotted on an R X R
- plane. That is, there is one pixel representing each Cartesian coordinate
- of the diagonal plot. If the "double-size" option is chosen, each
- Cartesian coordinate is represented by 4 pixels-- a 2 X 2 square of pixels.
- Choose this option to enhance viewing of small proteins, such as hemoglobin
- or cytochrome C.
-
- V.C.3. Change color scale for distance plotting. The color constant is a
- scalar used to convert an inter-C-alpha distance into a color code. The
- default value of the color constant is 2.7 and the conversion formula is
-
- color = nextLowestIntegerValueOf( distance / colorConstant )
-
- Each integer between 0 and 15 is associated with a color in the 4-bit color
- palette, as follows:
-
- 0=black 8=dark gray
- 1=blue 9=light blue
- 2=green 10=light green
- 3=cyan 11=light cyan
- 4=red 12=light red
- 5=magenta 13=light magenta
- 6=brown 14=yellow
- 7=light gray 15=white
-
- For instance, if the distance between residues X and Y is 23.5 Angstroms
- and the color constant is 2.7, then the value of distance/colorConstant is
- 8.69, and the next-lowest integer value of 8.69 is 8. Therefore, the color
- at (X,Y) on the diagonal plot will be 8=dark gray. If distance /
- colorConstant > 15, a white pixel will be displayed, representing the
- greatest distance class.
-
- V.C.4. Toggle on/off file with raw data for centrality plot. A graphical
- representation of the centrality scores for all residues in a crystal
- structure is useful in attempting to understand the meaning of this type of
- scoring. The 'centrality plot' for a protein is a line graph representing
- the centrality scores vs. the amino acid residue number. ABaCUS doesn't
- actually make these plots, but it is capable of writing an output file with
- all of the data (which can then be pasted into your favorite spreadsheet or
- graphing program and used to make a centrality plot). To make the output
- file, go to the settings menu and turn on the option to write centrality
- scores to disk. Then load a crystal structure, and choose "c=centrality"
- from the distance scoring submenu and choose the appropriate scoring
- scheme, as though you were scoring a set of introns-- it doesn't matter if
- there really aren't any introns in memory. A file named "cplot.sco"
- containing the centrality scores for all residues in the protein will be
- written to disk.
-
- V.C.5. Change cryptic output from reference gene generators. This is for
- those who wish to examine details of the distribution of reference exon
- sizes or intron positions. Mainly, these options were useful when the
- reference gene generators of ABaCUS were being tested for their ability to
- produce the desired distributions.
-
- V.C.6. Change cryptic output of file with distribution of scores. Once
- this option is invoked, the complete distribution of reference scores (the
- mean score for each reference set, not the individual exon or intron
- scores) for each hypothesis that is evaluated will be appended to a file
- called "nullscor.out". Each addition to the file also contains the
- observed score and descriptive comments that allow the user to match the
- set of scores with the experiment summary written using the summary writer.
-
- V.C.7. Toggle between weighted and unweighted exon scores. Exon scores
- will be weighted inversely by the size of the exon if this option is turned
- on.
-
- V.C.8. Toggle on/off pause to allow screen dumps of diagonal plots.
- Normally, when a diagonal plot is being viewed, ABaCUS will show the plot
- forever, or until the user presses a carriage return. During this time,
- ABaCUS will absorb key combinations that might otherwise be used to access
- an automatic screen-dumping utility such as PCXDUMP. Turning on the pause
- simply puts the diagonal plot on a timer for about 20 seconds during which
- a screen dump may be made before the diagonal plot disappears and the menu
- reappears.
-
- V.C.9. Treat gene edges as element edges when converting arrays. The
- default settings for ABaCUS stipulate that the ends of a gene are treated
- as the edges of an element. That is, if an alpha-helix includes residues
- 88-100 in a 100-residue protein, then an intron at (for example) position
- 96-1 is scored as though the nearest inter-element region lies just beyond
- the end of the gene-- just beyond codon 100-- rather than just before codon
- 88. We recommend not changing the default setting. However, if the
- alternative setting is chosen, be sure to have this option turned on *when
- the array is converted* to a new maximum score, since the converter is the
- function that implements this option. After the array has been converted,
- it doesn't matter what the setting is at any later when the array is
- viewed, saved, loaded, or used to assign scores. An array that has been
- converted with the gene-edge=element-edge option turned off may be
- converted back to its original form using the c=convert with the
- option turned on. Of course, changing this option only makes a
- difference in the case of proteins that have a structural element extending
- to an edge (e.g., the last helix of hemoglobin chains often extends to the
- very last residue of the protein).
-
- V.C.10. Initialize random number generator with user-defined seed.
- Normally, the random number generator is initialized at startup with
- computer clock time (seconds elapsed since 0:00:00 Greenwich mean time, 1
- Jan 1970) and this results in a unique set of numbers for each simulation
- experiment. However, if there is a need to generate exactly the same set
- of data twice, a seed may be set manually, then re-entered for a perfect
- replicate. There would only be two reasons to do this: a) you are testing
- the reproducibility of ABaCUS's routines to make sure there are no wierd
- bugs getting into them; b) you are an anally retentive type wishing to have
- complete reproducibility for the purposes of record-keeping. In either
- case, enter an unsigned 16-bit integer greater than 0, that is, a whole
- number less than 65,536. If exactly the same conditions (seed, reference
- model, number of introns/exons, gene length) are used twice, then
- exactly the same set of reference genes will be generated twice.
-
- V.D. HOW TO CONTACT THE PDB
-
- Access to the Brookhaven Protein Data Bank (Bernstein, et al. 1977; Abola,
- et al., 1987) is available by FTP or by Gopher (type 1, port 70, path 1/)
- to pdb.pdb.bnl.gov (130.199.144.1).
-
-
- ==========================================================================
- VI. REFERENCES
- ==========================================================================
-
- Abola, E.E., et al. 1987. Protein Data Bank, pp. 107-132 in
- _Crystallographic Databases - Information Content, Software Systems,
- Scientific Applications_ ed. F.H. Allen, G. Bergerhoff, and R. Sievers
- (Data Commission of the International Union of Crystallography, Cambridge,
- 1987).
-
- Banner, D.W., et al. 1975. Nature 255: 609.
-
- Bernstein, F.C., et al. 1977. J. Mol. Biol. 112: 535;
-
- Blake, C.C.F. 1978. Nature 273: 267.
-
- Blake, C.C.F. 1983. Nature 306: 535.
-
- Dibb, N.J. and A.J. Newman. 1989. EMBO J. 8 (7): 2015.
-
- Doolittle, W.F. 1987. Am. Nat. 130: 915.
-
- Gilbert, W., M. Marchionni, G. McKnight. 1986. Cell 46, 151. See also D.
- Straus and W. Gilbert, 1985. Mol. Cell. Biol., 5(12): 3497; and N. Lonberg
- and W. Gilbert. 1985. Cell 40: 81.
-
- Gilbert, W. and M. Glynias. 1994. Gene 135: 137.
-
- Go, M. 1981. Nature 291: 90.
-
- Go, M. 1983. Proc. Natl. Acad. Sci U.S.A, 80: 1964.
-
- Go, M. and Nosaka. 1987. Cold Spring Harbor Symp. Quant. Biol. 52: 915.
-
- Kemmerer, E.C. M. Lei and R. Wu. 1991a. J. Mol. Evol. 32: 227.
-
- Kemmerer, E.C. M. Lei and R. Wu. 1991b. Mol. Biol. Evol. 8(2): 212.
-
- Press, W.H. et al. 1992. _Numerical Recipes in C_ (Cambridge Univ. Press,
- London, 1992, 2nd ed.).
-
- Raitt, D.C., R.E. Bradshaw and T.M. Pillar. 1994. Mol. Gen. Gen. 242: 17.
-
- Stoltzfus, A., et al. 1994. Testing the Exon Theory of Genes: The
- Evidence from Protein Structure. Science XXX: XXX.
-