Bioinformatics in the Classroom

Identifying Genes in DNA Sequences

Identifying Genes in Nucleotide Sequences

Concepts:
  • The goal of gene prediction is to correctly identify and annotate as many as possible genes in raw sequences.


  • Many genes are conserved among organisms and can be identified through homology searches (e.g. through blastx, tblastx).


  • Messenger RNA fragments can be utilized to predict genes by locating their sequences within the genome.


  • Ab initio gene prediction attempts to identify genes without any knowledge of their function nor of the genetics of the organism. This can be accomplished because different gene features, such as exons, introns, promoters, etc., are associated with different patterns in the DNA sequence. These patterns cn be identified with the help of computer programs. Gene prediction through the identification of specific patterns in DNA sequences avoids the use of sequence homology in order to allow the identification of novel genes.


Gene prediction is geared to assign to a raw DNA sequence a structure like this:

Upstream Promoter First Exon Intron(s) Exon(s) Intron(s) Last Exon Downstream
Intergenic Region e.g. TATA Transcriptional Start, 5'-UTR, Translational Start, CDS/ORF & Enhancer Sites Frequent Stop Codons CDS/ORF & Enhancer Sites Frequent Stop Codons CDS/ORF & Enhancer Sites, Translational Stop, 3'-UTR, PolyA-insertion Site, Transcriptional Stop Intergenic Region


Gene features and DNA characteristics utilized in ab initio gene prediction include:

FeatureCharacteristic
Base compositionUsage of di-, tri-, hexanucleotides
Coding sequences (CDS)Open reading frames (ORFs); GC-rich; CpG-content
Translational start and stop sitesStart (ATG) and stop (TAA, TAG, TGA) codons
Splice sites (Exon/Intron borders)Consensus sequences (characteristic nucleotide combinations at and around the borders between exons and introns)
Splice sites (Exon/Intron borders)Consensus sequences
Promoter regionsTATA, Shine-Delgarno, Pribnow, Kozak Consensus; CpG-content
PolyA-signalsConsensus sequences (characteristic nucleotide combinations at about 10-20 nucleotides upstream of the insertion site for the polyA-tail)

The patterns associated with these characteristics have been uncovered during the analysis of genes and genetic regions prior to the availability of sequenced genomes.

Identifying Genes in Nucleotide Sequences - Exercises