Biological Sequence Analysis

Tim Beißbarth, Stefan Haas
tim.beissbarth@molgen.mpg.de, stefan.haas@molgen.mpg.de

Database searching with raw sequences

  1. Obtain the sequence with the accession number AW951311 by SRS
    (http://genius.embnet.dkfz-heidelberg.de/menu/srs/)
  2. Perform a blast search at the NCBI (http://www.ncbi.nlm.nih.gov/BLAST/)
    1. against the est database
    2. agaist the vector database
  3. Compare the results

Prediction of a human gene from genomic sequence and ESTs

Sequence retrieval

Obtain the GenBank entry of the human chromosome 16 BAC clone by SRS
http://genius.embnet.dkfz-heidelberg.de/menu/srs/
Accession number: AF001549

The complete sequence consists of 202004 bp. Copy the bases from positions 35101 to 80100 and save them as a file.

Screening for repeat sequences

Screen the genomic DNA against repetitive elements by means of Repeat Masker at http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker or http://woody.embl-heidelberg.de/repeatmask by uploading the above file and setting the return-format to html in the submission form.

Inspect the masked sequence.

EST-searching

  1. Perform a BLAST-search against human ESTs at NCBI (http://www.ncbi.nlm.nih.gov/BLAST/)
    1. with the unmasked part of the BAC clone
    2. with the repeatmasked BAC clone
    Discuss the difference in the obtained hits.
  2. Perform BLAST-search with the repeatmasked BAC clone against the EST clusters (GeneNest: http://genenest.molgen.mpg.de)
  3. Inspect the EST clusters. Obtain one or more consensus sequences of the contigs (and save to a file).

Gene prediction

Use GENSCAN and Genie to make gene predictions on the selected (repeatmasked?) 45 kB of the BAC-clone.
  1. Gene Prediction Programs
    1. GENSCAN ( http://genome.dkfz-heidelberg.de/cgi-bin/GENSCAN/genscan.cgi)
    2. Genie (http://www.fruitfly.org/seq_tools/genie.html)
    3. FGenes (http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html)
  2. Align one or several of the consensus sequences from GeneNest against the human genomic sequence using the program SIM4 (input sequences in plain text format). (http://pbil.univ-lyon1.fr/sim4.html).
  3. Compare the resulting protein sequences by means of a dotplot
    (use dotlet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html)
    1. make a dotplot to compare the consensus sequence(s) of the EST clusters (see above) and the predicted genes
    2. make a dotplot to compare the gene predictions

Verification and functional assignment

  1. Verify the above gene predictions by
    1. performing a BLAST search against ESTs
    2. searching against EST clusters with GeneNest
  2. What 's the function of the predicted gene? Perform a BLAST-search against proteins.
  3. Use the Pfam Database to screen for protein domains. (http://www.sanger.ac.uk/Pfam/)

Other sources (optional)

  1. Find the genomic region in the assembly of the public human genome project (http://genome.ucsc.edu/). Use the accession number (AF001549) as a query.
  2. Find the genomic region in the assembly of the Celera (http://www.celera.com/) (online registration required).
  3. Find gene in the Ensembl project (http://www.ensembl.org).
  4. Compare the gene prediction with the mRNA with the accession number: AJ272050

Prediction of a mouse gene by homology

Sequence retrieval

  1. Retrieve the sequence BB019265
  2. Find the sequence in GeneNest database by either Blast or query with the AC.
  3. Screen the "nr" database at NCBI with
    1. the sequence of BB019265 (Tip: try also repeat masking)
    2. the consensus sequence from GeneNest
    3. compare the results.
  4. Screen the genomic "htgs" database at NCBI with the consensus sequence from GeneNest

Gene Prediction

  1. Run gene prediction on a genomic sequence found with the GeneNest consensus and compare the results with the sequences of the genes retrieved from the "nr" search.
  2. Cut out a region of 10 kb from the genomic sequence. Use the homologous rat gene for the 2-oxoglutarate carrier to run homology based gene prediction using Gene Wise ( http://www.sanger.ac.uk/Software/Wise2/genewiseform.shtml)
  3. Compare the different gene predictions.

Alternative Splicing prediction based on EST data

For the following human EST sequences find a homologous cluster in the GeneNest database. Look for alternative splicing in the alignments of the consensus sequences of a cluster with the genomic sequence at http://splicenest.molgen.mpg.de

Optional:

Analysis of tissue-specificity based on EST data

Analyze the following GeneNest EST clusters.
Which EST clusters may reflect tissue-specific genes/transcripts ?
Which tissues are most important for each cluster ?

Promotor recognition (optional)

Retrieve the Sequence for the human c-myc oncogene
(Accession Number: D10493)
the promoter around position 2300

  1. Use the transfac database (http://transfac.gbf.de/TRANSFAC/) to get the promoter description. Alternatively, SRS can be used. Use search to find the c-myc gene.
  2. Use the Matinspector program to search in a 1000 bp window around the promotor region for potenitial transcription factor binding sites.