HMM history General background Three Fundamental problems - Evaluation
- Decoding
- Training
HMM for CpG Islands - Bioinformatics
- Non-Bioinformatics
CpG Islands Problem - CpG Islands
- Definition
- Why interesting
- Hidden Markov Model for CpG
- What’s Hidden
Mathematica Implementation
Andrei Andreyevich Markov 1856-1922
Early 1900s Early 1900s - Markov conceives “Markov chains” including a proof of the Central Limit theorem for Markov Chains
- Studies with Chebyshev and takes over his classes at Univ. of St. Petersburg
1913 - Russian government celebrates the 300th anniversary of the House of Romanov
- AA Markov organizes a counter-celebration – the 200th anniversary of Bernoulli’s Law of Large Numbers
1960s 1960s - Use of HMMs developed by a cold-war era research team in a classified program at the Communication Research Division of the Institute for Defense Analyses. (Oscar Rothaus).
1970s - HMM work is de-classified and is soon being used in many peaceful applications.
Markov Chain ==> 0.5 probability that it will be sunny today and 0.25 that it will be cloudy or rainy
HMM Definition Hidden Markov Model is a triplet (Π, A, B) - Π Vector of initial state probabilities
- A Matrix of state transition probabilities
- B Matrix of observation probabilities
- N Number of hidden states in the model
- M Number of observation symbols
Evaluation Decoding Training
Given a set of HMMs, which is the one most Given a set of HMMs, which is the one most likely to have produced the observation sequence?
States A+,C+,G+,T+,A-,C-,G-,T- States A+,C+,G+,T+,A-,C-,G-,T-
DNA Sequence analysis DNA Sequence analysis Protein family profiling Prediction of genes Horizontal gene transfer Prediction of DNA functional sites. CpG island prediction Splicing signals prediction
Speech Recognition Vehicle Trajectory Projection Gesture Learning for Human-Robot Interface Positron Emission Tomography (PET) Optical Signal Detection Music Analysis
Some HMM based Bioinformatics Resources PROBE www.ncbi.nlm.nih.gov/ BLOCKS www.blocks.fhcrc.org/ META-MEME www.cse.ucsd.edu/users/bgrundy/metameme.1.0.html SAM www.cse.ucsc.edu/research/compbio/sam.html HMMERS hmmer.wustl.edu/ HMMpro www.netid.com/ GENEWISE www.sanger.ac.uk/Software/Wise2/ PSI-BLAST www.ncbi.nlm.nih.gov/BLAST/newblast.html PFAM www.sanger.ac.uk/Pfam/
CpG ISLANDS CpG ISLANDS “CpG” means “C precedes G” Not CG base pairs
Nucleotides - 4 bases in DNA: Nucleotides - 4 bases in DNA: - A (Adenine)
- C (Cytosine)
- G (Guanine)
- T (Thymine)
Away from gene regions: Away from gene regions: Near promoter and coding regions: - Methylation is suppressed:
- CGs remain CGs
- Makes transcription easier!
CpG-rich regions are associated with genes which are frequently transcribed. CpG-rich regions are associated with genes which are frequently transcribed. Helps to understand gene expression related to location in genome.
Q: Why an HMM? Q: Why an HMM? It can answer the questions: - Short sequence: does it come from a CpG island or not?
- Long sequence: where are the CpG islands?
So, what’s a good model?
HMM for CpG Islands Straight Markov Models
HMM for CpG Islands Combined Hidden Markov Model
HMM for CpG Islands What’s “hidden”?
(Evaluation – not in CpG Islands) Training Decoding
HMM for CpG Islands Training Problem
Viterbi Algorithm Viterbi Algorithm Decoding- Meaning of observation sequence by looking at the underlying states. Hidden states A+,C+,G+,T+,A-,C-,G-,T- State sequences C+,G+,C+,G+,A+ or C-,G-,C-,G-,A- or C+,G-,C+,G-,A+ Most Probable Path C+,G+,C+,G+,A+
Viterbi Algorithm Viterbi Algorithm Hidden Markov model: S, akl, , el(x). Observed symbol sequence E = x1,….,xn. Let vk(i) be the partial probability of the most probable path of the symbol sequence x1, x2, ….., xi ending in state k. Then: v l(i + 1) = e l(xi+1) max(vk(i) akl)
Summary Summary Computationally less expensive than forward algorithm. Partial probability of reaching final state is the probability of the most probable path.
Now, on to our Mathematica Now, on to our Mathematica implementation…
References… References…
R.Dubin,S.Eddy, A.Krogh, and G. Mitchison. "Biologiclal Sequence Analysis: Probablistic models of Proteins and nucleic acids. Cambridge University Press, 1998. chapters 3 and 5. A.Krogh,M.Brown,I.Saira Mian,Kimmen Sjolander and David Haussler "Hidden Markov Models in Computational Biology Appications to Protein Modeling J.Mol Biol. (1994) 253, 1501-1531 L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition”, Proceedings of the IEEE, Vol. 77, No. 2, Feb. 1989 On-line tutorial: - http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.html
Do'stlaringiz bilan baham: |