Bioinformatics in the Classroom

SNPs
Variations on the Human Genome - Exercises

SNP Exploration

The sequencing of the human genome has lead to the discovery of a bounty of SNPs; on average, two equivalent human chromosomes entail one SNP per 1300 nucleotides. Several research institutions are involved in SNP discovery, the most prominent of which is The SNP Consortium. The SNP Consortium consists of several corporations vested in pharmacy as well as in informatics, and four major academic centers for molecular genetics and genomics (see http://snp.cshl.org/about/).

SNP Exploration

The following exercises will familiarize you with SNPs, SNP phylogeny, and the SNP database of The SNP consortium.

  • Determine the differences between the following six sequences utilizing Sequence Server:

    1. agctggctgaatgctatctgcgtcgcgcgaaataaacgtcagcattcgttacatctctctagggc
    2. agctggctgaatgctatctgcctcgcgcgaaataaacgtcagcattcgttacatttctctagggc
    3. agctggctgaatgctatctgcgtcgcgcgaaacaaacgtcagcattcgttacatttctctagggc
    4. agctggctgaatgctatctgcgtcgcgcgaaataaacgtcagcattcgttacatttctctagggc
    5. agctggctgagtgctatctgcctcgcgcgaaataaacgtcagcattcgttacatttctctagggc
    6. agctggctgaatgctatctgcgtcgcgcgaaataaacgtcagcgttcgttacatctctctagggc

  • Haplotypes are groups of SNPs that occur linked together in an allele. Determine how the following six haplotypes could be derived from each other. Draw a phylogenetic tree, assuming that at each branching point only one nucleotide is being altered. Thus, fromout each branching point two lineages proceed: one carrying the parental haplotype and one carrying the haplotype that contains a change in one nucleotide position.

    1. --A--G--T--A--C--
    2. --A--C--T--A--T--
    3. --A--G--C--A--T--
    4. --A--G--T--A--T--
    5. --G--C--T--A--T--
    6. --A--G--T--G--C--


  • The SNP Consortium Database at Cold Spring Harbor Laboratory contains roughly 1.5 million entries. This file is to big to download so, please, don't try it on a course computer!!!!
  • However, you can view the content of the CSHL SNP database here.

  • In order to develop an understanding of the nature of the SNP database it is sufficient to look into small excerpts of the entire database. Below find a number of smaller packages derived from the entire database.


  • Select a package and analyze its content.
    1. How many SNPs does your database excerpt list? How many the entire database? (Hint: look at the last package in the list to answer this question)
    2. What entries (columns) are provided for each SNP?
    3. Select a few SNPs and list their genotypes. (hint: use the GenBank accession in the databank to find out what chromosome the SNPs are located on).
    4. List the different polymorphisms in your package. Which combinations did you find?
    5. In what percentages do the different polymorphisms occur in your package?
    6. List reasons that may lead to the great variation in frequencies. Hint: think about how the different alleles that have been found for a single locus may have been derived from each other. (Hint: examine the structures of the four different nucleotides in this listing, this website or this image)

  • Now analyze the entire database! Just kidding! Just kidding! But look at it, anyway.

    1. Can you think of a computer program that would have made the previous tasks easier? What features would this program have to entail? How about a program that would allow you to answer the following questions and still be home in time for dinner?
    2. One possible computer program can be viewed here. Please have a look at it and try to recognize some of the processes that you utilized during your manual analysis.
    3. Applying the program, how many different polymorphisms can you identify in the entire database?
    4. List all possible different allele combinations.
    5. Determine the ratios for the occurence of different polymorphisms in the entire database?
    6. Compare and contrast your results with your previous findings. Why does the databank contain a different number of SNPs than you had estimated previously (question 1) above)


  • Some more TSC SNP (CSHL)
    • Go to SNP Consortium
    • Check out 'About'
    • Find SNPs in and/or adjacent to Human Cadherin Gene (Heart)
      • 'About'; 'Gene Search'; type in search string; 'Search'; check respective box; 'Dump table'; view results
    • Find genes around particular SNP
      • Choose a SNP ID; open Blast Genome Search; perform search
      • Record Request ID for back later; 'Format'
      • In result window select 'Genome View'
      • Select record for 'Map element'
      • Make sense of result
    • Identify SNPs in your sequence
      • 'About'; 'Blast'; Paste in sequence
      • Select 'View features in this region'
      • How many SNPs were identified in your sequence?

  • SNPs in other databases



  • Examine the distribution of SNPs over human DNA contigs
    • Go to http://www.ensembl.org/
    • Explore the 'How do I ...?' section
    • Select a chromosome under 'Browse a Chromosome' (e.g. Girls go to Y and guys to X)
    • Examine the sequence visually and discuss whether SNPs appear to be randomly distributed over the genome or not
    • Run cursor over the different feature bars and larn how to navigate the display by utilizing the information provided in the pop-up windows, especially the function that lets you zoom in
    • Can you identify a correlation between occurence of SNPs and GC content and prevalence of CpG islands?
    • Do SNPs appear to more prevalent in genes or in intergenic regions?
    • Examine 10 intergenic and 10 intragenic SNPs in more detail. What alleles can you identify for each SNP? (Make sure that you record the exact location of each SNP, so that you can find it back after you closed the image)
    • Select a region that you want look further into and zoom into the image until you can identify the nucleotide sequence for this region

  • Examine the occurence of SNPs in different types of DNA sequences

    There are two basic types of SNPs, those within the coding regions of genes which are called cSNPs, and those outside of genes. Previous studies have found that, on average, genes contain about four SNPs per gene. This observation, together with the estimate of a total of ca. 30,000 genes in the human genome, indicates the existence of roughly 120,000 cSNPs. Of these, about 60% are estimated to be synchronous, i.e. to not lead to changes in amino acids, and 40% are expected to change an amino acid. These 50,000 non-synchronous cSNPs, together with an unknown number of regulatory and other non-coding but functional polymorphisms, comprise the bulk of common molecular variation with potential phenotypic consequences. In addition to these causal SNPs, linkage studies try to identify SNPs which are associated with the inheritance of metabolic phenotypes, yet located outside of genes and not involved in directly causing changes in phenotype.

    SNPs appear to occur at different frequencies in different DNA sequences. For example an analysis of a 1Mb Interleukin gene cluster on the human chromosome 5 (5q31), analyzed in 40 individual Northern Europeans, yielded the following results (Banerjee at al., 2001, CSHL Genome Meeting):

      Bp sequenced # SNPs Frequency
    Coding regions/exons 1977 4 1/494
    Introns 14700 40 1/367
    UTRs 1105 4 1/276
    Conserved non-coding sequences 11295 28 1/403
    Intergenic 13998 36 1/389
    Total 31778 85 1/374

    • Examine the occurence of SNPs in different types of DNA sequences. How would you explain the differences in frequencies? Why do the authors report SNPs to occur in higher frequencies than the stated number of 1,300 SNPs that occur on average between any two human alleles?

  • Follow this link to Chapter 2: SNPs in Biomedicine