Hmm for Cpg islands hmm history General background

Download 461 b.
Hajmi461 b.

HMM for CpG Islands

  • HMM history

  • General background

  • Three Fundamental problems

    • Evaluation
    • Decoding
    • Training

HMM for CpG Islands

  • HMM Applications

    • Bioinformatics
    • Non-Bioinformatics
  • CpG Islands Problem

    • CpG Islands
    • Definition
    • Why interesting
    • Hidden Markov Model for CpG
    • What’s Hidden
  • Mathematica Implementation

    • Training
    • Decoding

Andrei Andreyevich Markov 1856-1922

Early 1900s

  • Early 1900s

    • Markov conceives “Markov chains” including a proof of the Central Limit theorem for Markov Chains
    • Studies with Chebyshev and takes over his classes at Univ. of St. Petersburg
  • 1913

    • Russian government celebrates the 300th anniversary of the House of Romanov
    • AA Markov organizes a counter-celebration – the 200th anniversary of Bernoulli’s Law of Large Numbers


  • 1960s

    • Use of HMMs developed by a cold-war era research team in a classified program at the Communication Research Division of the Institute for Defense Analyses. (Oscar Rothaus).
  • 1970s

    • HMM work is de-classified and is soon being used in many peaceful applications.

Markov Chain

  • Sunny yesterday

  • ==> 0.5 probability that it will be sunny today and 0.25 that it will be cloudy or rainy

HMM Definition

  • Hidden Markov Model is a triplet (Π, A, B)

    • Π Vector of initial state probabilities
    • A Matrix of state transition probabilities
    • B Matrix of observation probabilities
    • N Number of hidden states in the model
    • M Number of observation symbols

HMM – Three Problems

  • Evaluation

  • Decoding

  • Training

Given a set of HMMs, which is the one most

  • Given a set of HMMs, which is the one most

  • likely to have produced the observation sequence?

States A+,C+,G+,T+,A-,C-,G-,T-

  • States A+,C+,G+,T+,A-,C-,G-,T-

HMM - Overview Training Problem

DNA Sequence analysis

  • DNA Sequence analysis

  • Protein family profiling

  • Prediction of protein folding

  • Prediction of genes

  • Horizontal gene transfer

  • Radiation hybrid mapping, linkage analysis

  • Prediction of DNA functional sites.

  • CpG island prediction

  • Splicing signals prediction

Speech Recognition

  • Speech Recognition

  • Vehicle Trajectory Projection

  • Gesture Learning for Human-Robot Interface

  • Positron Emission Tomography (PET)

  • Optical Signal Detection

  • Digital Communications

  • Music Analysis

Some HMM based Bioinformatics Resources




  • SAM


  • HMMpro



  • PFAM



  • “CpG” means “C precedes G

  • Not CG base pairs

Nucleotides - 4 bases in DNA:

  • Nucleotides - 4 bases in DNA:

    • A (Adenine)
    • C (Cytosine)
    • G (Guanine)
    • T (Thymine)

Away from gene regions:

  • Away from gene regions:

  • Near promoter and coding regions:

    • Methylation is suppressed:
    • CGs remain CGs
    • Makes transcription easier!

CpG-rich regions are associated with genes which are frequently transcribed.

  • CpG-rich regions are associated with genes which are frequently transcribed.

  • Helps to understand gene expression related to location in genome.

Q: Why an HMM?

  • Q: Why an HMM?

  • It can answer the questions:

    • Short sequence: does it come from a CpG island or not?
    • Long sequence: where are the CpG islands?
  • So, what’s a good model?

HMM for CpG Islands Straight Markov Models

HMM for CpG Islands Combined Hidden Markov Model

HMM for CpG Islands What’s “hidden”?

HMM for CpG Islands The Three Problems

  • (Evaluation – not in CpG Islands)

  • Training

  • Decoding

HMM for CpG Islands Training Problem

Viterbi Algorithm

  • Viterbi Algorithm

  • Decoding- Meaning of observation sequence by looking at the underlying states.

  • Hidden states A+,C+,G+,T+,A-,C-,G-,T-

  • Observation sequence CGCGA

  • State sequences C+,G+,C+,G+,A+ or C-,G-,C-,G-,A-

  • or C+,G-,C+,G-,A+

  • Most Probable Path C+,G+,C+,G+,A+

Viterbi Algorithm

  • Viterbi Algorithm

  • Hidden Markov model: S, akl, , el(x).

  • Observed symbol sequence E = x1,….,xn.

  • Find - Most probable path of states that resulted in symbol sequence E

  • Let vk(i) be the partial probability of the most probable path of the symbol sequence x1, x2, ….., xi ending in state k. Then:

  • v l(i + 1) = e l(xi+1) max(vk(i) akl)


  • Summary

  • Computationally less expensive than forward algorithm.

  • Partial probability of reaching final state is the probability of the most probable path.

  • Decision of best path based on whole sequence, not an individual observation.

Now, on to our Mathematica

  • Now, on to our Mathematica

  • implementation…


  • References…

  • R.Dubin,S.Eddy, A.Krogh, and G. Mitchison. "Biologiclal Sequence Analysis: Probablistic models of Proteins and nucleic acids. Cambridge University Press, 1998. chapters 3 and 5.

  • A.Krogh,M.Brown,I.Saira Mian,Kimmen Sjolander and David Haussler "Hidden Markov Models in Computational Biology Appications to Protein Modeling J.Mol Biol. (1994) 253, 1501-1531

  • L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol. 77, No. 2, Feb. 1989

  • On-line tutorial:


Do'stlaringiz bilan baham:

Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan © 2017
ma'muriyatiga murojaat qiling