Bioinformatics in the Classroom

Identifying Genes in DNA Sequences

Gene Prediction - Synthesis

Concepts:
  • A few sequence features alone are usually quite unreliable in locating genes in genomic sequences. Most sequence annotation programs utilize a variety of different strategies to successfully and comprehensibly annotate genomic sequences.


  • Sensitivity of gene prediction programs: How many exons were correctly identified?
    Sensitivity = True Positives divided by sum of True Positives and False Negatives
    Sens. = TP / (TP + FN)


  • Specificity of gene prediction programs: How many predicted exons are indeed exons?
    Specificity = True Positives divided by sum of True Positives and False Positives
    Spec. = TP / (TP + FP)

The following is a sample of articles that analyze the success rate of gene prediction and genome annotation tools:

  • Burge and Karlin (1997). Prediction of Complete Gene Structures in Human Genomic DNA; Tab. 1. Journal of Molecular Biology 268:78-94
  • Reese et al. (2000). Genome Annotation Assessment in Drosophila melanogaster. Genome Research 10:483-501
  • Guigo et al. (2000). An Assessment of Gene Prediction Accuracy in Large DNA Sequences. Genome Research 10:1631-1642
  • Rogic et al. (2001). Evaluation of Gene-Finding Programs in Mammalian Seqences. Genome Research 11:817-832
  • In general, automatic exon and gene prediction is bad (large number of false positives and missed genes. The sensitivity of automatic gene prediction is between 20% and 70%, the specificity between 5% and 57%).

    • ~ 50% of exon predictions are false
    • Short exons are poorly predicted
    • Conflicting predictions
    • Accuracy is species-specific

    Use of supporting evidence (ESTs, cDNAs. protein homology) dramatically increases annotation success!

    Go on to Finding Genes in Databases