Bioinformatics in the Classroom

Genes in DNA Sequences - Exercise 1

    Exercise 1 - Ab Initio Gene Prediction

      Identify the gene in this human nucleotide sequence. Record your results on this worksheet, and also in a word processor file and/or in form of screen shots.

      In order to predict genes in a DNA sequence follow these steps:


      1. Narrow down gene region and direction by determining the strand it is encoded by ('Watson'- strand vs. 'Crick'-strand. The 'Crick'-strand is the reverse complement of the 'Watson'-strand.), as well as the start and the end of the gene.
      2. Predict exon/intron borders.
      3. Identify ORFs that could harbor the coding sequences of exons.
      4. Identify start and stop codons. Adjust exon/intron borders to yield one contigous CDS over the run of the gene.

      Here is how to do it:


      1. Determine the start and the end of the prospective gene:

        • To determine the beginning of the gene run sequence through a Transcriptional Start Site Finder and determine possible start regions. Save your results in your notepad file or as screen shots. (Should you experience problems with the program, view the results here.)

        • To determine the end of the gene run the sequence through this Poly A Signal Predictor. Save your results in your notepad file or as screen shots. (Should you experience problems with the program, view the results here.)

        • Mark in your worksheet the promoter region, and the prospective translational start and polyA-signal sites. Which direction is the gene most likely transcribed in?

        • For the sake of time you have not made any predictions for the reverse (Crick) strand. If you wish to do so, built the reverse complement of the sequence here and run the promoter and polyA-signal prediction programs again.


      2. Discern exons and introns by identifying splice sites (i.e. the borders between exons and introns):


        • Run sequence through this Splice site prediction program at UC Berkeley changing the donor score cutoff to 0.88 and acceptor score cutoff to 0.94 (donor site at beginning of intron, acceptor site at end of intron). (Should you experience problems with the program, view the results here.)

        • What does the output page tell you about the prediction mechanism?

        • Use the output to determine the splice site positions and record the results in the worksheet.

        • From the total of six splice site predictions build a number of alternative maps that show how exons and introns could follow each other in this gene.


      3. Determine how the predicted splice sites align with open reading frames:

        Relate the predicted splice sites to ORFs: donor sites (where exons end) should be preceeded by ORFs and acceptor sites (where exons begin) should be followed by ORFs.


      4. Determine which of the predicted splice sites truly border exons and introns:

        Relate the predicted splice sites to start and stop codons: donor sites (where exons end) should be preceeded by ORFs and acceptor sites (where exons begin) should be followed by ORFs.

        • For each splice site examine where exactly it is preceeded (donor sites) or followed (acceptor sites) by a stop codon utilizing this table. The data in this table are derived from applying the Translation tool in Sequence Utilities, checking 'Show only start and stop codons'. Make sure you examine all three reading frames +1, +2, and +3, for each splice site. (Hint: start out with identifying the ORF in the internal exon. Then determine the CDS in the last exon and the translational stop site. Then figure out the first exon and the translational start site.)


      5. Characterize the gene answering the following questions:


        • What characteristics does this gene have? (Length, exons, introns, splice sites, promoter, etc.)?
        • What do the mRNA and amino acid sequences for this gene look like?
        • Is it possible that the gene in this sequence consists of only one CDS and does not entail any introns?
        • Identify PCR primers that could be used as markers for your gene. (A link to primer design software is under above.)


      6. Documentation and more:

        Find links to an in-depth documentation for this exercise (as well as an additional gene-identification case for 2,000 nucleotides) in the worksheet index. Both these documentations extend vastly beyond the steps listed above and provide in many cases the results for a number of different prediction programs (screen shots for promoter and splice site prediction programs, etc.). In case of the 1,500 nucleotide sequence it also starts out slightly different in that it provides three clones that first need to be assembled into the sequence of 1,500 nucleotides. Go to Worksheets.


Go on to Exercise 2 - Gene Prediction Software

Go on to Gene Prediction - Synthesis

Back to Exercises Genes in DNA Sequences

Back to Genes in DNA Sequences

Back to Schedule