Biological Sequence Analysis

Stefan Haas
stefan.haas@molgen.mpg.de

Database searching with raw sequences

  1. Obtain the sequence with the accession number AI557731 and BG570348 via SRS
    (http://genius.embnet.dkfz-heidelberg.de/menu/srs/)
  2. Perform a BLAST search at the NCBI (http://www.ncbi.nlm.nih.gov/BLAST/) against the nr database

  3. Compare the results. Can we assign the sequences to a specific gene?

Prediction of a human gene from genomic sequence and ESTs

Sequence retrieval

Obtain the GenBank entry of the human chromosome 16 BAC clone by SRS
http://genius.embnet.dkfz-heidelberg.de/menu/srs/
Accession number: AF001549

The complete sequence consists of 202004 bp. Cut out the bases from positions 35101 to 80100 and save them to file.

Screening for repeat sequences

Screen the genomic DNA against repetitive elements by means of Repeat Masker at http://repeatmasker.genome.washington.edu/cgi-bin/RepeatMasker or http://woody.embl-heidelberg.de/repeatmask by uploading the above file and setting the return-format to html in the submission form.

Inspect the masked sequence.

EST-searching

Select a 40 kb genomic region from the Drosophila genome (chromosome 3R:25863269..25903268 at http://hdflyarray.zmbh.uni-heidelberg.de/cgi-bin/gbrowse).

  1. Perform a BLAST-search at NCBI (http://www.ncbi.nlm.nih.gov/BLAST/)

    1. against ESTs
    2. against nr
  2. Perform BLAST-search against the EST clusters (GeneNest: http://genenest.molgen.mpg.de)

  3. Inspect the EST clusters. Obtain one or more consensus sequences of the contigs (and save to a file).
What are the differences between the results; advantages/disadvantages?

Gene prediction

Run gene predictions on the selected (repeatmasked?) 40 kb of genomic sequence.
  1. Gene Prediction Programs
    1. GENSCAN ( http://genome.dkfz-heidelberg.de/cgi-bin/GENSCAN/genscan.cgi)

    2. Genie (http://www.fruitfly.org/seq_tools/genie.html)

    3. FGeneSH (http://www.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind)

  2. Align one or several of the consensus sequences from GeneNest against the human genomic sequence using the program SIM4 (input sequences in plain text format). (http://pbil.univ-lyon1.fr/sim4.php).

  3. Compare the resulting protein sequences by means of a dotplot
    (use dotlet: http://www.isrec.isb-sib.ch/java/dotlet/Dotlet.html)
    1. make a dotplot to compare the consensus sequence(s) of the EST clusters (see above) and the predicted genes

    2. make a dotplot to compare the gene predictions

Verification and functional assignment

  1. Verify the above gene predictions by
    1. performing a BLAST search against ESTs

    2. searching against EST clusters with GeneNest

  2. What 's the function of the predicted gene? Perform a BLAST-search against proteins.

  3. Use Interpro (http://www.ebi.ac.uk/interpro/) or the Pfam database (http://www.sanger.ac.uk/Pfam/) to screen for protein domains.

Upstream regulating sequences

  • Check upstream regulating sequences at the CORG web site (e.g. genes BHMT, HNF4, DYPS)
  • Use the Transfac database to search for potential transcription factor binding sites. (online registration required)

    Which strategies are used?

    Alternative Splicing prediction based on EST data

    For the following human EST sequences find a homologous cluster in the GeneNest database. Look for alternative splicing in the alignments of the consensus sequences of a cluster with the genomic sequence at http://splicenest.molgen.mpg.de

    Optional:

    Analysis of tissue-specificity based on EST data

    Analyze the following GeneNest EST clusters.

    Which EST clusters may reflect tissue-specific genes/transcripts ?

    Which tissues are most important for each cluster ?

    Is this prediction consistent with alternative data sources, e.g. Gene Expression Atlas, Source, Gene Cards?


    Tissue-specific alternative splicing

    Run a query at http://splicenest.molgen.mpg.de/cgi-bin/ESTbase/query.cgi?Hs7 (preliminary interface) to find tissue-specific isoforms. Use the options "brain", "Skipped Exon" and "Best display for every cluster:".

    Where are preferential locations of these alternative exons?

    Search for the gene LMO7, how reliable is the prediction?

    Comparing tools

    What are the differences of Ensembl, UniGene, GeneNest/SpliceNest, TIGR gene indices?