James Murphy

Hadoop Part II: Particle Filters in Hadoop MapReduce

Hadoop Part II: Particle Filters in Hadoop MapReduce

In this article, I’m going to go about implementing a basic particle filter in Hadoop MapReduce.  This is really just a personal interest project for me to get started learning Hadoop based on an algorithm that I am familiar with and suits MapReduce (to some extent), but this might have wider applications in some communities where heavyweight particle filters are in use (e.g. in data assimilation for climate and weather modelling).  If you are interested in more details, please get in touch.

Particle Filters

The particle filter (PF) is a Bayesian sequential Monte Carlo method that takes a stream of observations (one at each time period) and uses them to generate a weighted collection of samples (particles) that approximately represent the posterior state distribution of a state space model after seeing the current observation.  This approximate posterior is then used as the basis for the next step of the filter.

In the simplest ‘bootstrap’ version, the particles are updated according to the transition model of the state space model, i.e. by taking each particle x and putting it through the state transition function f(x) for the next period.  This gives the state space model’s prediction of the state at the next observation time, creating an approximate prior distribution for the state at that time as a collection of particles (step 1 in the algorithm below).  This is then updated into the posterior distribution via re-weighting the particles according to the next observation, using the rules of importance sampling (step 2).  In this basic bootstrap form of the PF, the predictive prior is used as the importance distribution, so we hope that that is a good approximation to the posterior (if not, the variance of the weights will be too high, which in practice means that almost all the weight will concentrate on the least-bad particle).  In cases where the observation is very informative and the state space model does not make good predictions, the predictive distribution might not resemble the posterior distribution very much, which will result in high weight variances.  Such cases require more more sophisticated proposal strategies (such as using extended or unscented Kalman filters for making proposals adapted to the observation), or better models. We don’t worry about those cases here, however.

The basic bootstrap particle filter algorithm (in pseudo-code) is therefore:

The posterior approximation at the point (*) is given by p(x_t | y_{1:t},\alpha) = \sum_{i=1} W[i,t] \delta_{ \{ X[i,t] \} } where \delta_{ \{ X[i,t] \} } is a Dirac point mass at X[i,t].

If you want to know more about particle filters, there are lots of good tutorials and review papers.  A good place to start is probably the 2007 review, “An overview of existing methods and recent advances in sequential Monte Carlo” by Cappé, Godsill and Moulines.  Or, of course, you can wait a few months and check out my forthcoming book…

MapReduce

MapReduce describes a two-step programming paradigm which can be used for processing data sets.  Given a dataset {a,b,c,d,…}, the “map” step applies a function f to each of the data set elements, giving a new dataset {f(a), f(b), f(c), f(d),…}.  Because the function is applied separately to each element of the data set, this operation lends itself to parallel processing, because there is no interaction during the application of f to each member of the dataset.  Furthermore, the data set can easily be broken up into subsets and mapping done on each of these separately (useful in the case of large datasets being processed on multiple computing nodes).

The “reduce” stage of a MapReduce program takes the processed data set {f(a), f(b), f(c), f(d),…} and applies some sort of aggregation operation to it, yielding some useful information.  For example, the reduce operation might be to take the average of the map function values by summing up over them and dividing by the number of elements in the dataset.  The reduce operation does require communication, because it requires the mapped data be gathered from each of the nodes on which it is processed.  However, the nature of this communication is simple and only occurs at a single well-defined point in the program execution.

MapReduce might seem like a restrictive paradigm, and it is – in fact, that is its point.  It is designed to simplify the interaction model between computing nodes in multi-node data processing systems so that the parallelization can be effectively exploited.  This simple model is what allows Hadoop to automatically deal with distribution and scaling behind the scenes, as well as allowing it to be robust to node failure and many other things.  On the other hand, it is often possible to achieve the required results, especially if several map-reduce operations are chained one after another.  In the following sections, we will look at how the particle filter can be implemented as a MapReduce program, allowing it to be distributed using Hadoop.

MapReduce in Hadoop

MapReduce programs in Hadoop consist of three Java classes:  A mapper class, which extends the TODO class and implements the map part of the program; a reducer class, which extends the TODO class and implements the reduce part of the program; and a driver class, which tells the system how everything hangs together.

Hadoop also offers the possibility of local combiners, which are like reduce steps that only act on the data held on each local node.  These can be useful to reduce communication overhead because they can reduce the output from a node into a single value or small set of values.  For example, if the reducer was going to calculate the sum of all elements in the mapped data set, local combiners could calculate the sum for all mapped data elements on each node, allowing only the per-node sums to be transmitted to the reducer(s).

A MapReduce Particle Filter

The particle filter fits relatively well into the MapReduce paradigm, because work on each particle is independent during the proposal and weighting stages, which, if a complex proposal mechanism and system model are used, can make up the bulk of the computation.  These stages fit naturally into the “map” part of the program.

Normalization of the weights requires knowledge of the unnormalized weights from each particle, and so has to form part of a “reduce” step.  Here, we also include resampling as part of the reduce step, since it requires knowledge of the normalized weights.  However, resampling might be possible as part of a subsequent second map stage, as discussed at the end of this post.  For simplicity, we deal with it in the reducer.

The following three sections look at the mapper, reducer and driver in turn for a basic particle filter for the very simple linear Gaussian noisy lag 1 autoregression (AR(1)) system of the form:

 

x_t = \alpha x_{t-1} + \epsilon_t

 

y_t = x_t + \eta_t

 

with \epsilon_t \sim \mathcal{N}(0,\sigma_{s}^2) and \eta_t \sim \mathcal{N}(0,\sigma_{obs}^2), where \mathcal{N}(a,b) is a Gaussian distribution of mean a and variance b.  The problem that we are trying to solve is to determine the unknown (latent) state variables x_t given a series of noisy observations y_t.  In this case, the system is linear Gaussian, so this problem could actually be (better) solved using a Kalman filter.  That’s okay – we’re just using this as an example problem – it would be very easy to change this to something more interesting – and anyway, it’s kind of handy to have a reference method to compare the results to.

Note that, as will all practical implementations of particle filters, all weights are stored as the log-weight to avoid numerical underflow when they are unnormalized.

Mapper

The Java code for the Mapper looks like this:

Mappers like this one must extend the (abstract) base class Mapper, overriding the map(…) method.

Input to the mapper takes the form of a key-value pair.  The type of the key is given by the first generic type of the class (i.e. the first class name inside the anglebrackets), here LongWritable and the type of the incoming values is given by the second generic type (here Text).

Output is similarly a key-value pair, with key type specified by the third generic type (here Text) and value type specified by the fourth generic type (here Particle, see below).  As shown here a custom type (Particle) can be used as output, as long as it implements the Writable interface.

Points to note about this mapper are:

  • Its input is a set of evenly-weighted particles from the previous stage (we assume resampling has happened at the previous stage, yielding a collection of evenly-weighted particles as its output).  These are described by their (scalar) positions.
  • These scalar positions (xprev) are encoded in Text objects, which are the values of the incoming data; these Text objects are parsed to retrieve the incoming particle’s state value.
  • The keys of the incoming data are ignored in this mapper, since all particles are treated the same.
  • It has a special sampling procedure for the first particle filter stage, when it samples from the state 1 prior, rather than the transition function.
  • Parameters (including the “isPrior” indicator, which tells it whether this is the first stage or not, are passed as part of the configuration, which can contain named properties (also see the Driver section to see how these are set).  The configuration is passed to all mappers as part of the Context object they receive as an argument (called context here), so is a good way to transmit (smallish) shared parameter to them all.
  • The output is written by writing a key-value pair of the correct types to the Context object that is passed to the function.
  • In this mapper, all particles are given the same key, the Text object “particle”.  This key is used to determine what happens to the mapper output at the reduce stage.  Values with the same key get sent to the same reducer.  Here, therefore, all values will be sent to the same reducer.  This is what we want, since we need to add up all the unnormalized particle weights so we can normalize them.  In many applications, on the other hand, you want to find things out about several categories of outputs, so several different keys might be used.  This is actually more desirable, because this allows multiple reducers to be run on different nodes, increasing parallelization benefit.

The class Particle used here is a simple custom particle class that is defined as follows:

Note that Particle implements the Writable interface, which means that it has readFields(…) and write(…) methods. These allow Hadoop to read and write the Particle object internally, so that it can be passed around, e.g. from Mappers to Reducers.  What they do not do is allow the particle to be written out to a file.  This will be discussed in the next article in this series.

Reducer

The code for the reducer looks like this:

Reducers must extend the (abstract) base class Reducer, and override the reduce method.  As with the Mapper, the generic types taken by the class (inside the angle brackets in its definition) indicate the key and value types of the reducer’s input and output.  The input key are Text objects and their values are Particles – these should match the output of the Mapper that is creating them.  This reducer is going to output key-value pairs with keys of type Text and values of type DoubleWritable.

Unlike the Mapper, the input to the Reducer contains multiple key-value pairs, representing the mapper output of all the objects from the Mapper’s output that have been sent to it.

Things to note about this reducer:

  • Weight normalization is performed by first subtracting the maximum log weight of any particle from all particles (equivalent to dividing by the maximum weight), and then normalizing the resulting weights.  This ensures that at least one value will have a non-zero weight once the unnormalized log-weights are exponentiated, avoiding the problem of numerical underflow if all unnormalized weights are very small.
  • The resampling scheme is a stratified resampling scheme, which is slightly quicker to run than a multinomial resampling scheme, because it avoids having to search through the set of particle weights each time a draw is made.  This might be useful for large numbers of particles.
  • The output is a series of DoubleWritables, which are output by writing them to the context (as in the Mapper).  Here, these DoubleWritables indicate the state value of the particles in the particle collection after resampling.  Because of the resampling type used here, they are all evenly weighted and so we don’t need to output their weights.

Driver

The final part of Hadoop Map-Reduce programs is the Driver, which tells Hadoop how everything hangs together, and provides an entrypoint for the program.  The code for the particle filter driver (PFDriver) looks like this:

Things to note about this driver are:

  • The location of the observations and the number of particles to use are set by the command line arguments args[0] and args[1], respectively.
  • ReadObservations(…) is a helper function, which reads the observations from the filesystem (Code is given at the end of the article for this method).  For large datasets, this data would likely be read one observation at a time from the filesystem, but for simplicity here, it is all read in in one go at the start.
  • WriteStageZeroInput(…) is a helper function which generates input for the first stage.  Because the number of particles is set by the number of Mappers that are spun up in each stage, which is in turn determined by the number of input datapoints they get, it seemed like the easiest way to set the number of particles at the start was to generate some dummy input (which is discarded) with the correct number of datapoints.  (Code is given at the end of the article for this method).
  • Parameters (used by the Mapper) are set by setting named values in a configuration, which is part of the Context object that will be passed to each Mapper and Reducer.
  • The driver then loops over each observation, setting up and running a Map-Reduce job at each iteration of the for loop, and making sure the input of one stage is the output of the previous stage.
  • For each stage (i.e. after each observation), the driver sets up parameters specific to that stage (whether this is the first stage or not) and the current observation.
  • A new job is set up for the stage, by setting the configuration, driver, mapper and reducer, along with the type of the output keys and values.
  • Finally, the input and output for the job are setup.  Here FileInputFormat and FileOuputFormat are used, which are a pair of standard InputFormat and OutputFormat classes.  FileInputFormat will read (text) files in line by line, sending each line off to a Mapper.  FileOutputFormat will write text files (called “part-r-xxxxx”, where xxxxx is the number of the reducer whose output the file contains) with each new key-value pair on a new line.  These are simple default Inputs and Outputs, others are available, and, for maximum control over input and output, custom record readers and writers can also be defined.  The creation of multiple outputs is covered in the next article.
  • The name of the file “part-r-0000” can be used here because, due to the structure of the reducer’s output (see above) we know there will only be one reducer, so this file will be the one that contains the resampled particles.
  • The very last line tells the system to wait for the job to complete before continuing, which is necessary because we want the data from one stage available before we start the next.

Results

The output from each stage is a folder containing a part-r-0000 file.  This file contains the state value of each post-resampling particle on a line.  We therefore expect it to contain duplicate entries corresponding to more heavily-weighted particles.

pfbasic1000

The results of running the Hadoop Map-Reduce particle filter over 100 stages, with 1000 particles are in the figure.  There is a fair bit going on in this figure, but the output of the particle filter is shown as grey shading with a white line running through the middle of it.  This reflects the mean of the particle collection (and thus the PF’s estimate of E[x], the expected state value at each stage) plus/minus two standard deviations, which gives an idea of the uncertainty in the estimate.  Since this is a linear Gaussian problem, we can also use a Kalman filter to do this estimation (exactly), and the red lines show the value of E[x] (solid red line) plus/minus two standard deviations (dashed red lines).  As you can see, these match up closely with the corresponding estimates from the Kalman filter.  The blue line (and stars) show the true value of the hidden value x at each stage.  As you can see, this is quite closely approximated by the estimate of E[x] and, in all cases, falls within the 2 std band.  Finally, the black dust at each stage shows the actual locations of the (evenly weighted) particles, with each being represented by a single dot.  The parameters of the process here are: \alpha = 0.9, \sigma_{s} = 1, \sigma_{obs} = 1.  We give the filter the correct values of these parameters.

So, there you have it.  A functional Hadoop Map-Reduce particle filter.  In subsequent articles we will look at how to extend it to output multiple files and how to get it to do online parameter estimation for the autoregression parameter, using online EM.

One gotcha when running this program: it won’t run if the output subdirectory already exists, so you have to delete that each time you want to run it.  Not too much of a problem but the error message is slightly cryptic.

Refinements

In the algorithm outlined here, the reducer is doing a lot of work.  Furthermore, there can be only one reducer, so we are really forcing all this computation to take place on a single node.  If we have a situation where we want to deal with a million or a billion particles, this might be overwhelming.

There’s no getting around the fact that at some point, a sum has to be calculated over all particles in order to normalize the particle weights.  So that is always going to be a potential bottleneck if you have a lot of particles.  Although, only the weights are required to be transmitted for this, and local combiners could sum up the weight of all the particles being dealt with on a local node, meaning only one value per node ever needs to be transmitted.  So, even for the large-scale case with, say, a million particles on 1000 nodes this shouldn’t be too onerous.

Resampling is an interesting question.  If a single reducer is used as in the above model, this could create a bottleneck when moving to, say, a million particles, since one node would have to do all the sampling for a million particles.  It would be much nicer to split this up amongst the nodes.  But is splitting up the resampling task possible?  Well, I think so.

The purpose of resampling is to create an equivalent delta-mass based posterior representation but with lower weight variance than the one you start with.  This avoids the (inevitable) problem of all the mass concentrating on a single particle after a few stages if resampling is not used.  What we would like to do is to break up this resampling task into a set of smaller tasks that can be distributed across the nodes.

One possible scheme (I think, although I have not verified this) is to break up the particles into subsets, and to locally resample each subset.  Each subset i will produce K_i offspring particles, with (equal) new weights set so as to give a total new weight for the subset equal to its incoming weight.  The split into subsets can be arbitrary, but take into account the number of computational nodes available, allowing resampling to occupy all the nodes.  The number of output particles K_i for subset i should be based on a target number of overall particles N, and then set as K_i = ceil(W_i N), where W_i is the weight of subset i.  This will result in each subset generating an extra particle, resulting in N+M-1 output particles being generated.  This will result in N+M-1 particles being generated.  This resampling method will not give quite such low weight variance as standard resampling, but can be parallelized into an arbitrary number of subsets M<=N.

Within a Map-Reduce context, implementing this scheme will require a second Map-Reduce stage.  The first Mapper will be as here, but now the second Reducer will simply calculate the weight sum.  The second Mapper stage will then normalize the particle weights and output particles with normalized weights that are randomly assigned to different groups (by e.g. setting their keys to one of M values).  The second Reducers (of which M will be created) will then perform resampling for each of the M subsets and output their (locally) resampled particles.

Conclusion

This has been a bit of a mammoth article, but we now have a working, if basic, Hadoop Map-Reduce particle filter!  In the next article we will explore how this can be extended by allowing it to create multiple output files.

Appendix – Ancillary Code

The PFDriver.ReadObservations(…) and PFDriver.WriteStageZeroInput(…) ancillary functions are given below.  They are static methods of the PFDriver class.  Note that these work with the local filesystem, not the HDFS and so will only work with Hadoop in its LocalJobRunner mode (see previous article).  They are intended merely as ancillary functions, so don’t make any claim to robustness or quality.

First, ReadObservations(…), which opens a text file and reads each line as an observation:

And then PFDriver.WriteStageZeroInput(…), which opens a (specific) file and writes n lines of data to start off the particle filter having n particles: