Palash's GSoc Blog

Sunday, 23 August 2015

Project Completed .. :)

I am basically done with my project for building the modules for Dynamic Bayesian Network, with the ability to do inference over it. I am currently waiting for my PR #465 to be merged right now.

Google Summer of code(GSoc) as an experience overall seemed challenging and at the same time felt very conducive for learning, as I have learned about stuff, that I may not have learned outside of the program. Sometimes when the difficulties were too much of a burden, the capacity to overcome them and solving them requires a certain amount of effort, endurance and patience. This is what GSoc has really taught me.

I am also planning to add the approx. inference module for Dynamic Bayesian Network after my Gsoc, since we now have a stable implementation for likelihood weighting , forward and rejection sampling( Thanks to Pratyaksh).

Wednesday, 22 July 2015

Map inference in Dynamic Bayesian Network

I am about to be finished with junction tree algorithm for inference in dynamic bayesian network. I am about to start with the map queries for inference.

Currently, the map queries finds the maximum value using the maximize operation. But for the dynamic bayesian network we need to compute a path that has the maximum probability also called the viterbi path.

The viterbi algorithm uses the famous dynamic programming algorithm paradigm, although it could be quite complicated to implement for a dynamic bayesian network.

Also, the inference in junction tree will further optimize the scale of operations, so the variable elimination will not try to lag the algorithm down.
I hope that the inference works well now.

Tuesday, 30 June 2015

Inference in Dynamic Bayesian Network (continued)

For the past 2 weeks I have spent some time understanding the algorithmic implementation for inference and implementing it. Today I will be talking about the junction tree algorithm for inference in Dynamic Bayesian Networks.

For processing the algorithm, here are the following steps

1) Initialization :- This requires constructing the two initial junction trees J1 and Jt.

J1 is the junction tree created from the initial timeslice. is the junction tree created from the timeslice 1 of the 2TBN(2 - timeslice bayesian network).Jt is the junction tree created from the timeslice 2 of the 2TBN. Time counter is initialized to 0. Also, let the interface nodes(denoted by I1, I2 for the timeslices 1 and 2 respectively ) be those nodes whose children are there in the first timeslice.
If the queries are performed on the initial timeslice. Then the results can be output by the standard VariableElimination procedure where we could have the model having the timeslice 1 of the bayesian network as the base for inference.
For evidence, if the current time in the evidence is 0, then the evidence should be applied to the initial static bayesian network. Otherwise, it has to be applied to the second timeslice of the 2-TBN.
For creating the junction tree J1, the procedure as follows:-

Moralize the initial static bayesian network.
Add the edges from the interface nodes so as to make I1 a clique.
Rest of the procedure is the same as it was before. The above step is the only difference.

For the junction tree Jt, a similar procedure is followed, where there is a clique formed for I2 as well.

2) Inference procedure:- In this procedure, the clique potential from the interface nodes is passed onto the interface clique. (similar to the message passing algorithm).

The time counter is incremented accordingly.

So basically the junction tree Jt` seems some sort of the engine where the in-clique is where the values are supplied and the out-clique is where the values are obtained, given the e

The variables in the query are taken out as always at each step, and the evidence is applied also.

The best part about this procedure, that this method eliminates entanglement and only the out-clique potential is required for inference.

The implementation is still in progress.

Wednesday, 17 June 2015

Inference in Dynamic Bayesian Networks

Today, I will be talking about how the inference works in Dynamic Bayesian Networks.
We could have applied the following methods,
1) Naive Method:- There is one way where we could unroll the bayesian network as much as we'd like and then apply inference methods that we applied for the standard bayesian network. However, this methods lead to exponentially large graphs, thereby increasing the time that it takes for inference.
(It was a surprise to me, wondered if we could do so, but then leads to a lot of problems)
2) Forward and Backward Algorithm:- We could apply this algorithm but then this applies exclusively for hidden markov model(hmm). This involves converting each of those nodes into state spaces, basically increasing the sizes, again leading to huge complexity.
So as to reduce the complexity of inference, the methods are as follows:-
We could compute a prior belief state by using some recursive estimation.
Let's assume that

Then we could propagate the state forward by the following algorithm.

So this complicated jargon would be always multiplied for a recursive procedure so as to do the filtering algorithm.
Other Algorithms are the Frontier Algorithms and the Interface Algorithms which are more popular for the inference on a large scale.
Apart from that, there could be certain expectation maximization algorithms which may help in computing the most probable path.
This is what I am planning to implement in a couple of weeks.

Wednesday, 3 June 2015

Hidden Markov Models

In the previous week, I have described about what dynamic bayesian network. I have also written the

basic framework that the dynamic bayesian network encompasses. Currently, I am planning the hidden markov model framework. The functionality of the hidden markov model, will be similar to that of matlab.
1st week report
I have finished off coding the dynamic bayesian network that has the methods unrolling and removing time slices.

What are hidden markov models(HMM)?

They are basically dynamic bayesian networks on a smaller scale, which makes them lesser complex and more efficient. The above example, consists of nodes Q1 and Q2 which are known as hidden nodes while the nodes Y1 and Y2 are known as the observed nodes.
The hidden states are represented by discrete random variables. Again this is a 2-TBN similar to the dynamic bayesian networks.

The probabilities of the observed nodes are given by the formula.
$P(Z_t|Z_{t-1}) = \prod_{i = 1}^{N} P({Z_t}^{i}| Pa({Z_t}^{(i)})) $
where $Z_t$ is the $i^th$ node in the time slice $t$ also ( $N = N_h + N_o$) and $Pa({Z_t}^i)$ are the parents of the $Z_t^i$ node which may be in the same time slice or the previous one.
The above figure shows a 2-TBN. The probabilities can be given as follows:-
$P(Q_2, Y_2) = P(Q_2|Q_1)\times P(Q_1) \times P(Y_1|Q_1)$
The general equation becomes:
$P(Q_{1:T}, Y_{1:T}) = P(Q_1)P(Y_1|Q_1) \times \prod_{t= 2}^{T}P(Q_t|Q_{t-1})P(Y_t|Q_t)$
There are more standard sophisticated models such as the coupled HMM's
which look like this

Now in the above chain there are several hidden variables that are interacting with each other, with the observed variables $Y_1,Y_2$ and $Y_3$
The chains interact directly with each other affecting the adjacent chains. So, there might be a possibility of $X_2'$ interacting with the 1st chain if unrolled further.
The coupled HMM's are usually used in detecting temperatures for fire alarms and help in explicitly encoding the relation between them.
Another application, that the HMM helps in understanding is the most probable states in a sequence of time slices. The maximum probable state is then found by the popular Viterbi Algorithm.
The states are found out as follows:-
\begin{array}{rcl}
V_{1,k} &=& \mathrm{P}\big( y_1 \ | \ k \big) \cdot \pi_k \\
V_{t,k} &=& \max_{x \in S} \left( \mathrm{P}\big( y_t \ | \ k \big) \cdot a_{x,k} \cdot V_{t-1,x}\right)
\end{array}
Basically in every state, the maximum probable state is always chosen and then the probabilities of the maximum probable state is found out using the simplistic chain rule.
The algorithm is remarkably similar to the one used in belief propagation used in bayesian model.
These algorithms make hidden markov model remarkably significant, and makes them of serious application to speech recognition and convolutional codes.
This is what I am planning to implement within the next 2 weeks. Then I will start with inference in dynamic bayesian networks.

Wednesday, 13 May 2015

Dynamic Bayesian Networks in pgmpy

I am really excited for my selection in GSoc'15. I hope that I will live up to the expectations of my mentors and be able to complete my project in the allotted duration.
My project is about dynamic Bayesian networks, which is time-variant extension of the static Bayesian networks,and have a wide range of applications in protein sequencing and voice recognition systems.
So what is a Bayesian network?
Bayesian network is a directed acyclic graph(DAG) that is an efficient and compact representation for a set of conditional independence assumptions about distributions. They are an elegant framework for learning models from data that can be combined with prior expert knowledge.
Here is what a static Bayesian Network looks like,
(The conventional student example presented in the book Probabilistic Graphical Models:Principles and Techniques by Daphne Koller and Nir Freidman is as follows.)

The above directed graph tries to represent the random variables as nodes in a graph.These nodes represent the random variables and the edges represent the direct influence of one variable of one another.
Here we are trying to monitor the grade the student that the student is going to get, which is conditionally dependent on the difficulty of the subject. The recommendation letter assigned to the student will now be stochastically dependent on the grade assigned by the professor.
Here I have assumed that the grade assigned by the professor is ternary valued and the rest of the variables are secondary valued.
$Difficulty :- Domain = Val(D) = \{d^0(easy),d^1(hard)\}$
$Grade:- Domain = Val(G) = \{g^1,g^2,g^3\}$
$Letter:- Domain = Val(L) = \{l^0(strong),l^1(weak)\}$
$Intelligence:- Domain = Val(I) = \{i^0,i^1\}$
$SAT:- Domain = Val(S) = \{s^0, s^1\}$

In general each random variable is associated with a Conditional Probability Distribution also called as a CPD that specifies the distribution over the values of the random variable associated with its parents.The CPD encodes the distribution of the variables and help in precisely determining the output of the variable. Here is what a CPD encoded Bayesian network would look like:-

One such model $P(I)$ represents the distribution of intelligent versus the less intelligent students. Another model $P(D)$ represents the distribution such that the difficult classes are distinguished from the lesser difficult ones. (Let's call the above Bayesian Network as $B_{Student}$ for future reference)
Let's consider some particular case for this example.
$P(i^0,d^1,g^2,s^1,l^0)$
The probability of this event can be computed from the events comprising it so by this formula by the chain rule
$P(I,D,G,S,L) = P(D)P(I)P(G|I,D)P(S|I)P(L|G)$
Thus the probability of this state can be used by the formula, so the probability of this event can be given by
$P(i^0,d^1,g^2,s^1,l^0) = P(d^1)P(i^0)P(g^2|i^0, d^1)P(s^1|i^0)P(l^0|g^2)$
$ = 0.4*0.7*0.25*0.05*0.4 = 0.014 $
Thus using the chain rule we can compute the probability of any given state.

Now, let's come to Dynamic Bayesian Networks

Dynamic Bayesian Networks(DBN's) are static Bayesian networks that are modeled over an arrangement of time-series. In a Dynamic Bayesian Network, each time slice is conditionally dependent on the previous one.Suppose that $B_{student}$ is duplicated for a time series to form
a two time slice Bayesian Network(2-TBN). (The variables are shown with a single letter notation for a compact representation.The value in the subscript notation denotes the value of the time series that the variable belongs to),it will look as follows

Assuming that the output $l$ is the observed output(or the evidence) along the course of time,there are two sets of edges in the above Bayesian Network. The first set of edges are the intra-slice edges that represent the influence between the random variables inside the original static Bayesian Network. Thus, the topology of the initial Bayesian Network is same as that of the $B_{student}$.
The second set of the edges are the inter-slice edges that represent the conditional influence between random variables in two different slices that are adjacent to each other(here the underlying assumption is that this a 2-TBN for a lesser complexity.If this were a n-TBN, the inter-slice edges could branch out from the first slice to any of the n-th time slice).
The probabilities among the original distribution now determine the probabilities in the successive time series.The conditional influence between $D^0$ and $G^0$ will remain the same as that of $D^1$ and $G^1$.Thus the CPD's among the intra-slice edges will remain the same.
However there will be a requirement for CPD's along the inter slice edges, that will further provide more information about the random variables in other states.
Now if we were to determine the probability of $P(L_{1}) $, it would not only be dependent on the random variables that are present in the time series but also the previous state's variables too.
In the next blog post, I will further inform how to compute the conditional probabilities in a DBN.

Here is another complicated example that could demonstrate how tedious DBN's could look.
This example is of a BATmobile: Towards a Bayesian Automated Taxi suggested by forbes.

.
Source:-http://bnt.googlecode.com/svn/trunk/docs/usage_dbn.html