Transition Matrix
When Xt is discrete, the conditional distribution P(Xt | Xt-1 ) can be written as a matrix A where Aij = P (Xt = j | Xt-1 = i) is the probability of going from state i to state j. and each row sums to 1. (stochastic matrix)
Machine Learning
Wednesday, 28 May 2014
Tuesday, 27 May 2014
Directed Graphical Models (Bayes Nets)
Introduction
Suppose we observe multiple correlated variables (such as words in a document or genes in a microarray).
Chain Rule
The problem with this expression is that it becomes more and more complicated to represent the conditional distributions. If we use conditional probability tables, to represent each probability we need:
Conditional Independence
Markov Assumption: Let us assume that "the future is independent of the past given the present) or:
Graphical Models
A graphical model is a way to represent a joint distribution by making conditional independence assumptions.
The nodes represent random variables.
lack of edges represent conditional independence.
Directed Graphical Models
Known as Bayesian networks, belief networks or causal networks.
Ordered Markov Property: the assumption that a node only depends on its immediate parents and not on all predecessors in the ordering. Thus:
Thus in general we will have:
Markov and hidden Markov Models
for every species i and location or locus j along the genome, we create 3 nodes:
The observed marker Xi,j : This can be a property such as blood type, or a fragment of DNA
Two hidden alleles,
Suppose we observe multiple correlated variables (such as words in a document or genes in a microarray).
- How can we represent the joint distribution?
- How can we use the distribution to infer one set of variables given another in a reasonable amount of computation time?
- How can we learn the parameters of this distribution with a reasonable amount of data?
Chain Rule
The problem with this expression is that it becomes more and more complicated to represent the conditional distributions. If we use conditional probability tables, to represent each probability we need:
- O(k) for P(X1)
- O(K^2) for P(X2| X1)
- .....
- O(K^V) for P(Xv | X1, X2, ... Xv-1 )
Conditional Independence
Markov Assumption: Let us assume that "the future is independent of the past given the present) or:
- Xt+1 is independent of Xt-1 given Xt
Graphical Models
A graphical model is a way to represent a joint distribution by making conditional independence assumptions.
The nodes represent random variables.
lack of edges represent conditional independence.
Directed Graphical Models
Known as Bayesian networks, belief networks or causal networks.
Ordered Markov Property: the assumption that a node only depends on its immediate parents and not on all predecessors in the ordering. Thus:
Thus in general we will have:
Markov and hidden Markov Models
first order Markov chain : current state only depends on immediate past
second order Markov chain: current state depends on two levels from the past
We could create higher order Markov chains but the number of parameters will blow up and so we assume that there is an underlying hidden process, modeled by a first-order Markov chain. This is called Hidden Markov Model.
First order Hidden Markov Model
- Zt is hidden variable at "time" t. They represent quantities of interest, such as the identity of the word that someone is currently speaking.
- Xt is observed variable at "time" t. They are what we measure, such as the acoustic waveform.
- P(Zt | Zt-1) --> transition model
- P( Xt | Zt ) --> observarion model
- P( Zt | X1:t) --> state estimation
for every species i and location or locus j along the genome, we create 3 nodes:
The observed marker Xi,j : This can be a property such as blood type, or a fragment of DNA
Two hidden alleles,
Monday, 5 May 2014
Linear Regression
All credits go to Kevin P. Murphy (Machine Learning A Probabilistic Perspective)
But Linear Regression is not Robust to outliers.
=> Use a distribution that assigns higher likelihood to outliers, without having to perturb the straight line to "explain" them. (Laplace distribution) The robustness arises from using absolute value for residuals instead of squared residuals.
Problem with ML estimation is that it can overfit, because it is picking the parameter values that are the best for modelling the training data, but if the data is noisy, such parameters often result in complex functions.
Adding the Gaussian prior to the parameters of a model to encourage them to be small is called l2 regularization or weight decay. By penalizing the sum of the magnitudes of the weights, we ensure the function is simple. In the figure below, we see that increasing lambda results in smoother functions and the resulting coefficients also become smaller.
Lambda shows how much the model is constrained.
As we increase lambda:
- the error on the training set increases
- The test set has the characteristic U-shaped curve, where the model overfits and then underfits.
We can use cross validation to pick lambda.
Regularization is the most common way to avoid overfitting. Another way is to use lots of data. The more training data we have, the better we can learn.
Subscribe to:
Posts (Atom)