Identifying Genes in Nucleotide Sequences
Concepts:
- The goal of gene prediction is to correctly identify and annotate as many as possible genes in raw sequences.
- Many genes are conserved among organisms and can be identified through homology searches (e.g. through blastx, tblastx).
- Messenger RNA fragments can be utilized to predict genes by locating their sequences within the genome.
- Ab initio gene prediction attempts to identify genes without any knowledge of their
function nor of the genetics of the organism. This can be accomplished because different
gene features, such as exons, introns, promoters, etc., are associated with different
patterns in the DNA sequence. These patterns cn be identified with the help of computer programs.
Gene prediction through the identification of specific patterns in DNA sequences avoids
the use of sequence homology in order to allow the identification of novel genes.
|
|
Gene prediction is geared to assign to a raw
DNA sequence a structure like this:
| Upstream |
Promoter |
First Exon |
Intron(s) |
Exon(s) |
Intron(s) |
Last Exon |
Downstream |
| Intergenic Region |
e.g. TATA |
Transcriptional Start, 5'-UTR, Translational Start, CDS/ORF & Enhancer Sites |
Frequent Stop Codons |
CDS/ORF & Enhancer Sites |
Frequent Stop Codons |
CDS/ORF & Enhancer Sites, Translational Stop, 3'-UTR, PolyA-insertion Site, Transcriptional Stop |
Intergenic Region |
Gene features and DNA characteristics utilized in ab initio gene prediction include:
| Feature | Characteristic |
| Base composition | Usage of di-, tri-, hexanucleotides |
| Coding sequences (CDS) | Open reading frames (ORFs); GC-rich; CpG-content |
| Translational start and stop sites | Start (ATG) and stop (TAA, TAG, TGA) codons |
| Splice sites (Exon/Intron borders) | Consensus sequences (characteristic nucleotide combinations at and around the borders
between exons and introns) |
| Splice sites (Exon/Intron borders) | Consensus sequences |
| Promoter regions | TATA, Shine-Delgarno, Pribnow, Kozak Consensus; CpG-content |
| PolyA-signals | Consensus sequences (characteristic nucleotide combinations at about 10-20 nucleotides upstream of the insertion site for the polyA-tail) |
The patterns associated with these characteristics have been uncovered during the analysis of genes and
genetic regions prior to the availability of sequenced genomes.
|
| Identifying Genes in Nucleotide Sequences - Exercises
|
|
|
|
|