Bioinformatics in the Classroom

Finding and Retrieving DNA Sequences in Databases

Finding and Retrieving Nucleotide and Protein Sequences - Introduction

DNA and protein sequence databases are huge and it can be a challenge to find answers to specific questions. Some examples for queries are
  • finding sequences which are similar to a template sequence;
  • finding sequences for specific genes whose name is known;
  • finding sequences for genes whose funtion is known;
  • finding functions for genes whose sequence is known;
  • finding protein sequences for known nucleotide sequences;
  • finding background information for genes, proteins, or phenotypes;
  • finding chromosomal locations for genes;
  • finding homologous genes in different organisms;
  • identifying specific characteristics in DNA sequences (e.g. exons, introns, promotor regions, etc.)
In order for databases to be useful, the data must be readily accessible to users. GenBank is the repository of the National Center for Biotechnology Information (NCBI), which is a part of the US National Library of Medicine. This repository entails 11 billion bases (as of March, 2001) and is growing so fast that biocomputing sites which have a local copy of GenBank need to update nightly (vs. a few times per year in 1999).

Researchers from all over the world access GenBank, predominantly through the Internet. Each sequence in GenBank has a unique accession number and GenBank entries can be queried with a variety of tools (Entrez, Blast, FASTA, ...). Entrez is a web-based search engine for GenBank DNA and protein sequences as well as MEDLINE reference publications. It can also search for keywords such as gene names, protein names, and the names of organisms or biological functions. Entrez is also a database which contains all of the nucleotide and protein sequences in GenBank (updated daily) along with all of the literature in MEDLINE and the 3-D protein structures in PDB (Protein Data Base). It also entails a pre-computed list of relationships between all of its data elements:

  • Relationships between sequences are computed with BLAST.
  • Relationships between articles are computed with "MESH" terms (shared keywords).
  • Relationships between DNA and protein sequences rely on accession numbers.
  • Relationships between sequences and MEDLINE articles rely on both shared keywords and the mention of accession numbers in the articles.
These pre-computed relationships might include genes in the same multi-gene family, articles written about genes that have the same function, or other proteins that function in the same biochemical pathway, thereby allowing horizontal movement through the linked databases. Thus, a researcher can start with only a vague set of keywords or a sequence identified in the laboratory and rapidly access a set of relevant literature and a list of related database sequences.

Database Search Strategies

  • General search principles - not limited to sequence (or to biology).
  • Use accession numbers whenever possible.
  • Start with broad keywords and narrow the search using more specific terms.
  • Try variants of spelling, numbers, etc.
  • Search all relevant databases.
  • Save your results locally on your machine or mainframe!
  • Refine the Query.
  • Often a search finds too many (or too few) sequences, so you can go back and try again with more (or fewer) keywords in your query.
  • The 'History' feature allows you to combine any of your past queries.
  • The 'Limits' feature allows you to limit a query to specific organisms, sequences submitted during a specific period of time, etc.

 

Finding and Retrieving Sequences - Exercises